The 3 V’s of Big Data: Velocity Remains A Challenge for Many

Do you remember the early days of Big Data and the three V’s: Volume, Variety, and Velocity? Some people got a bit carried away with the alliteration and came up with even longer lists, but those three were at the heart of most lists. Data management vendors have made considerable efforts to tackle those problems because they were, and still are, a driving factor in new projects for their customers and prospects (and thus, were wonderful opportunities for revenue for them), but while many companies have succeeded in overcoming their volume and variety challenges, some seem to still contend with velocity issues. 

Big Data Volume and Variety have been addressed

The “Big” in Big Data started out meaning “large” (the first V, Volume), which really meant “too large for us to adequately handle right now”. There was a massive amount of new data that was being generated by and on the Internet and the thinking was (and still is) that it all should be stored for analysis later. Storage is always getting bigger, improving in both total amount able to be stored and data density, so dealing with volume didn’t require too much of a shift in technology.

Variety was a bit more of a challenge – it required a new way of thinking. Much of the new data being generated was not necessarily structured (relational) data that fit easily into existing traditional relational database management systems, but rather included unknown types of data or unstructured data. Companies wanted a way to store it all very quickly without analyzing types or doing transformations first, which would have allowed it to fit into their RDBMSs. That problem was addressed by creating new kinds of databases, non-relational or NoSQL databases, that could store any data, especially unstructured data, very quickly. But that introduced another problem, which I’ll touch on in the next section.

 

 

Big Data Velocity

Big Data Velocity has been the most challenging V to conquer and it remains a hurdle for many companies. It is especially tricky, and important, since it has a compounding effect on the other Vs. Storing large amounts of data isn’t necessarily a big challenge for some companies, but storing it as quickly as it arrives and more importantly, being able to analyze it in real-time, the moment it arrived, as opposed to the typical batch approach of analytics jobs over terabytes or petabytes of data that took hours or even days to complete, is still just a hope and a dream for many companies. 

By creating NoSQL data stores to handle lots of data and different types of data (volume and variety), companies like Google, Meta (nee Facebook), and LinkedIn created tools that were fast enough to handle the ingestion of lots of different data quickly (the Big Data trifecta – all three V’s) but something had to be sacrificed. The majority of the latency was really just moved downstream – reappearing when someone wanted to make use of that data, perhaps by analyzing it. Now they couldn’t analyze it quickly or easily, as they could with their relational data, because they had no way of easily retrieving data since there was no metadata (schema) stored which could tell them exactly what each data bit was and where it was. They had prioritized being able to store the data quickly over being able to retrieve and analyze the data quickly, so they were willing to sacrifice the metadata.

The key takeaway here is that these new NoSQL data stores were not able to eliminate latency but rather they moved it further downstream in the data processing pipeline – to the point of analysis. The tradeoff to extremely fast ingestion was analysis that was slow and not real-time.

Why is speed (velocity) so critical in data management?

  • People don’t like to wait. If your application takes too long, the user might just stop using it and find one of your competitors who has faster response times.
  • Less time on one thing leaves more time for others. If there is a multi-step process that is barely meeting a critical SLA, cutting some time out of any one of the steps gives you more leeway for meeting that SLA. And many of today’s business services are measured against SLAs – violations of which can be very costly.
  • Faster speeds give you time to iterate. Sometimes actions are best guesses and responses are needed in order to fine-tune the action in subsequent steps. Think of your data scientist modeling data for a new algorithm. The process involves extracting data from a source system, developing a model, testing the model against a different test set of data, fine tuning the model, testing again, etc.  The faster each of those steps can be, the more accurate the data scientist can get the model in a given amount of time. Take the data scientist out of the loop, as many new machine learning applications do now, and you get similar benefits from algorithms that can run and improve more quickly.
  • Your company wants to make its actions, like sales and marketing activities, more accurate (and valuable), which may cause you to add even more data sources to your processes. Initiatives like 360° Customer Views attempt to pull together multiple disparate data sources, each of which may add a little data that can improve the resolution of your company’s view of the customer, which then makes it possible to make better, more timely promotions to the customer or to improve the customer’s satisfaction when they have a problem they need resolved.  Adding new data sources means more data coming at you that you need to deal with in the same amount of time, increasing the speed of data ingestion required. 

What is driving this “need for speed”?

The World is going “Ops”

The data management world is in a transition due to velocity. Companies are driven to improve their top lines and bottom lines – increasing revenues and decreasing costs – which implies increasing efficiencies. Time is a critical factor in efficiency (do more or do better with fewer resources in less time). Following on, information technology (IT) groups are driven to improve the time to value and the return on investment of their projects, to match similar efforts of other parts of the company, like engineering/development. It isn’t sufficient for a company to develop new applications in the vacuum of a pristine, fully controllable and never failing development environment and then throw it over the wall to the production people in IT to make it work in the real world (which is not fully controllable and never failing).

DevOps, MLOps, AIOps, DataOps. Ops, of course, stands for operations. Development Operations (DevOps) strives to have all code and applications being developed be able to easily and seamlessly transition from development and test into production. Agile development attempts to break down the long, slow waterfall development process into much more manageable modular function blocks that can be worked on and deployed independently, without the need for long upgrade processes and downtime. DevOps is so popular that it has spawned different Ops variants across the spectrum of IT and data management functions, broadly ITOps and DataOps but more specifically as things like AIOps and MLOps. 

The Cloud

The Cloud is a contributing factor for operationalizing IT. Applications deployed in the cloud need to stay up and running, even when new functionality is added or bugs are fixed. Maybe more importantly, deployment infrastructures like Kubernetes are used to let these applications scale up and down seamlessly, so IT teams need to make sure that all of their applications are “cloud-native”, and can be properly managed by automatic systems like Kubernetes. 

Result: The Rise of Real-Time

Acting in real-time means acting in the moment, reacting to newly created data (fresh data) as soon as possible. In order to achieve it, we need to examine every source of latency in our current systems.  We have been writing much about real-time because we think we offer a solution that enables true real-time operations but we see many companies still stuck at “near real-time” since that is the best that past technologies had been capable of. 

What can you do?

Simplify your solutions

Complex IT architectures can hide latency. Each interface, each dependency on a previous component adds a potential time sink. Reexamine the overall architecture to determine if there are any ways to simplify or streamline it. Maybe one of the components has added new capabilities or functionality since the system was originally designed, making one of the other components now redundant so it might be eliminated, simplifying the solution, eliminating interfaces, reducing latency and thus, improving velocity.

Reexamine and challenge your assumptions, “standards”, and defaults

Sometimes we do what we’ve always done, use what we’ve always used. Companies sometimes have standard platforms or tools that their development teams “must” use. There are good reasons for standards, like making sure platforms and tools are reliable, scalable, and secure through rigorous certification processes. But those standards can put severe limitations on the capabilities of new systems. If you assume you need to make use of your company’s standard relational database system, your solution will look very different than if you could use a non-relational database or a message queue.

If velocity remains a challenge for your company, reexamine your existing solutions. Are you still using the fastest, most efficient data pipeline? Are all of the components being used still the fastest in their categories? Or maybe they are fast enough?  Build in periodic system reviews so these kinds of assessments can be made. There are other factors that are just as important as speed (like reliability, for example), so the fastest isn’t always the best, but it is good to be aware of possible improvements for performance.

Technology changes quickly and our aversion to changing our beloved system designs can blind us to the availability of better, faster, more efficient components. On the other hand, it would be better to be able to use as much of what is in place now as possible, tweaking rather than replacing. Ripping and replacing is fraught with risk and is often not necessary – your current production system is working (I assume), so it is usually better to figure out how you can make it better rather than implementing a brand new, unproven, system.

Think in terms of “Best If Used By” for your new data

And finally, keep in mind that new “fresh” data gets stale very quickly – the value of that data drops dramatically minutes, seconds, or sometimes even milliseconds after it is first produced. After that, it just gets added to all of the other old data in massive collections  stored by companies that, most likely, will never be used.

In the realm of real-time data, you “use it or lose it” when it comes to the opportunity to make the right offer to the right person at the right time. That fresh new data is what tells you and your system when the time is right but it is a fleeting moment that you need to be able to react too quickly. Don’t squander the opportunity — make sure your systems can support acting on real-time data. Seeing how much data is being wasted (not being acted upon) and how much value is being lost might help you to justify new systems that can overcome your company’s data velocity challenges.

Here’s some other velocity and real-time content we’ve created recently that you may find interesting: