Upscaling our big data platform

Very, very large-scale internet measurements generate a lot of data. Two years ago, we prepared ourselves for a sudden increase of SamKnows measurement agents in the world - from thousands to millions. Meaning a huge increase in data. So, we built an entirely new cloud-based platform to handle our measurement results: SamKnows One.

SamKnows deals with a lot of data. We’re constantly collecting measurement results that accurately reflect the quality of fixed-line and mobile internet connections.

But two years ago, we were about to have so much data that we’d topple over. So, we took a deep breath and totally revamped our analytics and data platform, which is now called SamKnows One.

Here’s how we did it.

Back to the beginning…

Ten years ago, we started out by collecting fixed-line performance data using an older version of the SamKnows Whitebox. The Whitebox connects to the router and runs tests from the edge of the home to a SamKnows test server. Using traditional database technologies, such as MySQL, we built an analytics and data storage system that could easily handle thousands of measurement agents, which we then replicated for each SamKnows account.

But come 2016, things changed. We started to embed our SamKnows test suite into the routers of very large ISPs, along with the wireless access points of major mesh network vendors. Other ISPs expressed strong interest, citing it as the most accurate, flexible, and cost-efficient way for them to monitor the health of their networks. To put things in context, we were looking to go from between 2,000 and 15,000 measurement agents per account, to millions. We needed to scale up our data platform — and fast.

Piling on the pressure, we had also recently released two new solutions for mobile: our mobile SDKs and Rapid Build Framwork apps for Android and iOS. This was an exciting step for us, and although we’d worked in mobile before, these products were set to dramatically increase the number of measurement agents running SamKnows tests.

Building the SamKnows Platform

In March last year, we started to build a single platform to handle and analyse all this data. We engineered our SamKnows Platform, which later became the underlying platform for analytics in SamKnows One, our powerful analytics and agent management platform.

When it came to building SamKnows One, we started by working out what we wanted the platform to do. Here’s our list of core requirements:

1. Able to continuously ingest large numbers of small rows.

2. Easily slot into our existing data ingestion process.

3. Able to aggregate time-series data fast for SamKnows One analytics queries.

4. Export raw data in CSVs.

5. Query data from applications powering products across the SamKnows platform including alerting, charting, visualisation, and various APIs.

6. Ensure that the size of a client’s data wouldn’t affect the experience of other clients using the platform.

7. Filter and split by custom metadata attached to measurement agents (such as speed tier) without the platform slowing down.

8. Be able to support more than one hundred million agents and beyond in SamKnows One.

9. Have the ability to quickly access individual testing agents’ data for a consumer view, for mobile apps and for ISP customer care staff so that they could help find speedy solutions when the internet was underperforming for a particular customer.

After a lot of work with proof of concepts, we eventually settled on a technology stack that consisted, at its base, of a few distinct parts.

1. Hadoop HDFS — our storage layer.

HDFS is a distributed file system that acts as our storage layer. It allows us to leverage lots of servers that work together to form a single file system that is highly redundant and resilient while still being efficient.

2. Apache Hive — our database.

Hive is a relational database management system. It allows us to structure our metric data as well as bucket, partition, and time-series data so that it’s easy to read from and easy to write to. Hive uses Hadoop MapReduce so that it can also act as a query engine.

3. PrestoDB — our analytics query engine.

PrestoDB is a distributed query engine designed for analytics queries, developed by Facebook and used by other companies such as Uber, Airbnb, and Netflix. Although Hive could fulfil this purpose, Presto was much better at running analytics over large amounts of time-series data.

These technology choices allow us to unify many individual platforms for different clients into one single scalable and dynamic platform: SamKnows One. This platform includes multiple distinct data clusters, for resilience, maintenance, and data locality purposes, and utilise these three primary services that work together to store and use the analytical tools on the data.

The finishing touches

After establishing our data platform, we then had to find a way to select and filter the data so that it can be displayed, processed, and exported in SamKnows One. In the browser is a JavaScript web application, powered by PHP APIs. Those PHP APIs talk to our data platform (through Presto queries) to extract the metric data and blend that with any extra metadata.

We built a PHP library to interact with Presto (the query engine we use) so that we could then interact with the stored data. As with all Presto communication, the library works over HTTP. It handles submission queries, returning, and streaming results into PHP, where we build APIs to power everything data-related on SamKnows One; the analytics functionality to visualise the data, and the Data API to get raw data. We will open source this library in the future so keep checking back to this blog for further announcements.

Get in touch

If you’d like to learn more about SamKnows One, our embedded solution, or any of our mobile products, please contact the team at