The Future of Data Management

Post

I recently posed a question to colleagues that I believe have good product-visionary skills. The basic question was,

"What does the Data Management platform of the future look like? What will be the 'norm' in 5 years?".

The question is muddled to start with because Data Management is a flooded area. Business Intelligence, Data Warehousing and Data Lakes all fall under this banner. The good thing, was that most of my colleagues actually knew what I was trying to ask. "What are the new fields of Data Management that will revolutionize data instead of just 'what does a better Data Warehouse look like?'".

The interesting part was that all of us roughly aligned on the same vision, including well respected analysts in the field. The future is exciting, to say the least. Not to discount innovation, you can already presume that there will be a faster Data Warehouse, there will be a more scalable Master Data Management platform - but there will also be a fundamental new piece of the puzzle. The Data Fabric. It will remove a majority of the manual work we do in data projects today and will activate data on a fundamentally new level of activation. 

Before we dive into the future of data management, we have to pay homage to the initiators. I believe that the first obvious initiator for a push to a data revolution was the cloud. It is not only due to the scale, but also the almost "app store" like nature of bringing together components with more ease e.g. "You want BI? Great, click here and tell me where to get the data from. Now you have BI.". 

Another initiator was streaming. The simple idea was to make a fundamental pivot in the industry (which will still take years to establish) that data is pushed, instead of pulled. Why is this the right innovation? Because it works better with the cloud story, in that we have more consistent and predictable processing rather than staged and 'choppy' processing. It will require systems to be rearchitected to take the full advantage of streaming, but this is the innovation that we needed. 

The other innovation, is that some vendors are taking the approach of "give me data and I will tell you what to do with it"Power BI has this with its insights, Data Robot does it with what ML templates to run on data, ThoughtSpot has also been a key player in this market. These systems aren't perfect, there are some more innovations needed, but they are getting closer to enabling the future of data management. 

No alt text provided for this image

So what is the future of data management then? 

The future of data management is a radical simplification of the orchestration of data across your business. No more ETLno more modelling, and a mesh of services that constantly work on your data. I spoke to an analyst recently who said something poignant, "all data has joins, it's just about identifying them". I strongly agree with this and this IS the future of data management - connections. What it will result in, is quite an organic and self evolving network of relationships, enrichment and enablement. The good news, is that this is also something that can technically be achieved already today. 

But how do you achieve this? Isn't this all science fiction. No. It does require an evolution in the way that services discover each other, but fundamentally it will and can work with certainty. 

The complexity in building this future is also the pace of change. What works today, won't work tomorrow because of the rate of innovation. That is on the Vendor to keep up. That is doable. 

In my opinion, soon the modern data management platform will have to include:

  1. Cloud Empowered - not just that it runs in the cloud, but that it thrives on the cloud.
  2. A Native Polyglot Backbone i.e. you don't have to choose the underlying infrastructure. It can also adapt to your stack e.g. use SQL in Azure, use the respective database in AWS etc. The Fabric should orchestrate the best bang for buck and also what is necessary to achieve goals. For example, if it believes that it can service queries faster using a Time Series database, it will discover that possibility, ask services to provide that function and self optimise for you.
  3. Extremely Easy Integration to Cloud Products e.g. so damn easy to bring in Logging, Data Bricks, Metrics and future cloud products, even across cloud providers. (You might be asking, then why even give me a choice, just make it SAAS. Yes, I think this will happen to, but it requires you to be a huge Vendor to trust giving data to service outside your tenant. Salesforce can do it, most Vendors can't.)
  4. Active Metadata or data that never stops working for you. It is crazy that we run these data jobs to move data into a central place and then it just sits there waiting for you to do something with it. 
  5. Testable Data Infrastructure i.e. Data Ops. Data that can roll back, version control, deploy and cover with tests. 
  6. Self Modelling i.e. we know how to model for scale, why force that onto users? Why can't I tell the user that you would be best to do this on Data Bricks or at least make the suggestions. Why can't we self optimise indexes based off what is queried?
  7. Simplified Data Integration story i.e. too much of our customers still have the CDC discussion, Batch, Push, Replication discussion. I understand why, and how we will still talk for some time on it - but this will go away. We will eventually just think about "getting data", not how.
  8. Data Platforms that speak back to the user i.e. "I can do this, shall I?" I have detected Data Bricks, want to send data there?

Essentially, there is a need for a mesh of services that can actively look inside your data and discover the possibilities. The challenge is that services are not set up in a way to preview the results (Data Robot is a good example of someone that does this). This needs to happen so that a mesh of services can be provided to work on your data, without you prescribing the data. This would obviously be an opt in OR if Vendors can provide their service in a way that sandboxes your data, this technically can be achieved without giving the key to the castle.

As for an example of data never standing still. Imagine you have a company record and you have a value for the country. We can easily enrich this from other sources, we can from this find out other ways to uniquely identify the company e.g. ISO codes, Abbreviations. We can also gain population, capital city, main industries. From those industries, you can gain the top 10 companies in that industry and continue to follow that trail. You might argue that this would bring a lot of noise into the situation, but to be honest you don't need to surface this information, but more use it as a network of data. This is an example of the idea that all data has joins, even something as simple as a country. It is also the concept that data shouldn't sleep, it should be constantly working for you.

Now is the era where the data management field needs an "orchestrator". We have been given tools, frameworks and platforms - there is now the need for the orchestrators - that is the Data Fabric. Microservices are a good example of this, in that it had its advantages, but with every advantage you typically take disadvantages and the composition of these services are the parts that are made harder. This is exactly what state we are now in data. The composition is the hardest part. We need a strong composition and this is the idea of the Data Fabric. It is not hard to envisage, in fact the DevOps world is showing the path. Kubernetes is the orchestration of infrastructure and services, Data Fabric is the orchestration of data within those services. 

Fast forwarding to 10 years in the future of data management. The pie in the sky vision, is complete convergence. Data Lake, Data Warehouse, Data Fabric blend all into one product. This will establish, what companies will define as their Data Foundation. I will keep it short on the 10 year plan, as that is too far off to give details.