Moving the data pipeline

Find out more about how we're accommodating our data in a more efficient, fast-paced and scalable way with Google BigQuery.

At SamKnows, we deal with a LOT of data. As we’ve grown, naturally, we’ve taken on more and more data. As our database has increased, we’ve taken the necessary steps to accommodate it. It’s kind of like... when you rent a flat, you accumulate a lot of things. Dishware, plants, new family members, etc.. The more things you have, the more your space needs to reflect that increase in things. The same goes for us. Giving our data a nice, comfortable home and ideally, sustainable home, is crucial! We needed a place that can handle an enormous amount of information and can accommodate more as we go on.

Having a reliable infrastructure for this data is very important. Previously, we were using a Presto-based pipeline, which handled many millions of devices. But eventually, our data became too big for its boots and needed a new home. A home that required less maintenance and accommodated growth. To do this, we chose to move our data from Presto to BigQuery.

BigQuery favors scalability, and supports streaming ingestion of data. The entire stack is deployed inside Google’s Cloud Platform, rather than in our own data centers, which allows us to scale up much more easily as demand grows. Moreover, we can choose to host different parts of our database in different regions, which helps with meeting different compliance and regulatory requirements.

How does this data transition work?

The process of moving data is a delicate one. We were dealing with 14 years of data! It was also important to not cause any downtime on SamKnows One as we moved from one service to the other. We had to make sure the data pipeline was working properly before we stopped using Presto. The process went like this:

  1. Move the old data
  2. Update the pipeline, so that we send the new data to BigQuery
  3. Make sure that measurement data will go all the way up the pipeline
  4. Change the SamKnows One application

Once the change went through, the end user shouldn’t have noticed any changes in what was presented to them on their dashboard – other than the fact that the test results were available in SamKnows One much faster. After the switch, the time between the measurement taken and seeing it on your dashboard was reduced from 30 minutes to around 5 seconds.

Our new data pipeline

Real-time data

By re-architecting the data pipeline to remove batch processing, we can now stream data with almost no delay between the measurement and the results being available to our customers in SamKnows One. Real-time data has allowed us to develop new products such as FaultFinder which can automatically alert ISPs of any anomaly in their performance data that’s impacting customer experience. This is helping ISPs fix faults before a customer even notices there’s a problem, enhancing customer experience and reducing support costs. Our Realspeed test also compares performance at the home router with a device over Wi-Fi. Real-time data allows a customer care agent to see user-initiated test results instantly, which lets them troubleshoot home Wi-Fi problems with their customers over the phone, speeding up the resolution of calls.

Disconnection chart in ConstantCare

The new Disconnection chart in ConstantCare displays an up-to-the-second view of uptime or downtime on the connection, essential for quickly validating connectivity problems with your customers' connections. These are just a few features we’ve built recently that rely on a real-time flow of data, but it will undoubtedly be a key component of more future products from SamKnows.

If you would like to learn more about real-time data and how we’re continuously improving our products, request a demo!

Request a demo