Tuesday 20 February 2018

The Marriage of Data Management and Machine Learning

What do they know about each other anyway? © istock.com zamuruev
I was recently interviewed for the ACM SIGMOD Blog on the marriage of data management and machine learning, this is a big topic in the big data community. This is a joint interview with colleagues from Amazon, Google, San Diego and Stanford. The interview is from the point of view of the data management community, but may still be interesting to anyone interested in understanding the synergies of ML and big data management. Thanks to Azza and Paolo for putting this together.


If you are interested, here are a couple of more links on that topic:

Friday 9 February 2018

What is a Data Lake?

Let's hope that there is crystal clear water in that data lake. © istock.com mark_lipson

Several of our customers talk about "data lakes" these days. In our experience, the term "data lake" can be quite confusing, however.

Here is an attempt of a clarification.

Business data has traditionally been kept in highly structured relational databases as well as specialized analytical systems such as data warehouses. However, with the advent of big data it becomes more and more difficult to manage and analyze all that data through relational databases or NoSQL systems. Therefore, data lakes, collect all data of a company in raw format in a central place without enforcing schemas or any other data cleaning or data import operations in the first place. Those operations are only performed as a second step. Thus, full flexibility for data alignment and analytics is preserved. 

Technically, data lakes are typically implemented as a distributed file system (like HDFS) and all data belonging to a company is collected in such place. All further analysis, be it structured queries, data mining, traditional machine learning or deep learning, is then done in a structure- or pay-as-you-go fashion. 

For instance, the raw data in the data lake is distilled, cleaned, and enriched to crisp and clear information interactively in steps using an appropriate combination of workflows and tools. So in contrast to relational database systems — which own the data —, in a data lake the data is not necessarily owned by a specific tool or system, but rather shared among different tools.

The idea of a data lake is very similar to the concept of a dataspace where data from different sources gets integrated over time. Data lakes match very well the typical explorative workflow of data scientists who rarely rely on managing data in relational database systems.

A risk of data lakes is the misconception about the degree of structuredness and readiness for data analysis. From a technical point of view a data lake is nothing but a distributed file systems storing some of your business data. However, just collecting these files in a central place won't help much (actually this insight kicked off the idea of having a central database in the 1960ies!). Actually, by blindly collecting data in such a data lake, one may quickly end up with a data swamp.

This means, a data lake should rather be treated as a starting point for data assimilation and data cleaning and data transformation steps rather than a final data architecture (it is not even an architecture). 

There exist a myriad of techniques and tools that can help in that process.

So in summary: calling "data lakes" a technology is too big of a word.