Saturday 3 March 2018

A.I. will be like Electrical Current

A.I. will be like electrical current, © guvendemir

I am just returning from this bitkom's Big-Data.AI Summit 2018. This was two days packed with talks about data, A.I., data science, and everything in-between. The conference attracted about 1.200 people, mostly from industry, but also from academia here and there (like myself). And it was a very inspiring event. Congrats to bitkom for pulling this off!

A selection of some loosely connected takeaways:
  • Wave–particle duality => A.I.-Big Data duality: A.I. and Big Data are like wave and particle in physics: they are two views on the same thing. Therefore, it does not make sense to talk about A.I. without talking about (Big) Data and vice versa. Both are heavily intertwined. Therefore it was a very good idea to host both topics at the same conf. There should be much more interaction among these two fields, ahemm, I mean "views".
  • Bla bla bla: The confusion in industry about all these buzz words like big data, data lakes, NoSQL, machine learning, AI, data science, <you name it> is insane. This insanity is good to a certain degree (when you want to sell stuff to laymen), but also quite bad (when you are trying to understand what this is all about). As academics our role here is to lift the fog.
  • "big data"="data lake": These days, big data is typically read as either "large data" or "something with HDFS and a data lake". So, as observed in the past 10 years already, the term "big data" is a moving target.
  • It is a looonnnngggg pipeline. Sometimes people ignore that the entire data analysis pipeline that is required to analyze data is pretty long. This pipeline includes collecting data, cleaning data, curating data, normalizing data, managing data, selecting the right features (which is often more important than picking the right machine learning model anyways), ..., and, yes, eventually doing some fancy, or often just very old-fashioned, machine learning. But til you get there may take a while.

    Do you remember why data warehousing projects fail typically? Correct, it is not due to machine learning, it is due to difficult data cleaning and integration tasks. And many machine learning problems sound to me like good old warehousing projects where the final step, the data warehouse, is replaced by some ML/AI stuff.
  • No data. Some companies don't even have the data to do a meaningful analysis. Then their task is to identify which data should be collected in the future to allow for any meaningful analysis. This is much better than doing nothing. If you don't have the right data, you can't do meaningful analysis. Can you afford to wait another year til you even have some toy data to play with?
  • A.I. will be like electrical current. Supervised machine learning allows us to learn a function f(x)=y. We train models and learn that function with some data and then test it with some other data. In production, we use f() to make some prediction, i.e., we name the function predict():=f(). So we have one function. One.

    But any computer software has multiple to zillions of functions. So what happens if we start replacing those functions in our software gradually? This was the topic of a very interesting workshop, unfortunately only briefly discussing this effect then. I mean, forget about the performance implications, these will be solved and are non-issues for many types of software anyways these days. What kind of software will that be where many of its parts are simply trained models? How probabilistic will that software be? How will control flow look like in such software? It will be dependent on some probabilistic outcome of a model. This sounds pretty scary. Or maybe not.

    So, imagine your favorite word processor (or database system or whatever)  being implemented as a bunch of functions which are actually trained models. This sounds pretty undeterministic. But hey, actually, this might be an improvement over current word processors (or whatever software you have in mind), and more deterministic and robust than what we have today...

    So if you say "ML/AI will become a commodity", "ML/AI is the new oil", well, I feel this is probably not even strong enough. The atoms and molecules of software are functions. And those atoms and molecules will not necessarily be hand-crafted and coded anymore. Soon.

    This also applies to hardware. Some or many of those functions will be replaced by trained models. These functions will be everywhere and they will be used everywhere.

    Just like electrical current.

Tuesday 20 February 2018

The Marriage of Data Management and Machine Learning

What do they know about each other anyway? © zamuruev
I was recently interviewed for the ACM SIGMOD Blog on the marriage of data management and machine learning, this is a big topic in the big data community. This is a joint interview with colleagues from Amazon, Google, San Diego and Stanford. The interview is from the point of view of the data management community, but may still be interesting to anyone interested in understanding the synergies of ML and big data management. Thanks to Azza and Paolo for putting this together.

If you are interested, here are a couple of more links on that topic:

Friday 9 February 2018

What is a Data Lake?

Let's hope that there is crystal clear water in that data lake. © mark_lipson

Several of our customers talk about "data lakes" these days. In our experience, the term "data lake" can be quite confusing, however.

Here is an attempt of a clarification.

Business data has traditionally been kept in highly structured relational databases as well as specialized analytical systems such as data warehouses. However, with the advent of big data it becomes more and more difficult to manage and analyze all that data through relational databases or NoSQL systems. Therefore, data lakes, collect all data of a company in raw format in a central place without enforcing schemas or any other data cleaning or data import operations in the first place. Those operations are only performed as a second step. Thus, full flexibility for data alignment and analytics is preserved. 

Technically, data lakes are typically implemented as a distributed file system (like HDFS) and all data belonging to a company is collected in such place. All further analysis, be it structured queries, data mining, traditional machine learning or deep learning, is then done in a structure- or pay-as-you-go fashion. 

For instance, the raw data in the data lake is distilled, cleaned, and enriched to crisp and clear information interactively in steps using an appropriate combination of workflows and tools. So in contrast to relational database systems — which own the data —, in a data lake the data is not necessarily owned by a specific tool or system, but rather shared among different tools.

The idea of a data lake is very similar to the concept of a dataspace where data from different sources gets integrated over time. Data lakes match very well the typical explorative workflow of data scientists who rarely rely on managing data in relational database systems.

A risk of data lakes is the misconception about the degree of structuredness and readiness for data analysis. From a technical point of view a data lake is nothing but a distributed file systems storing some of your business data. However, just collecting these files in a central place won't help much (actually this insight kicked off the idea of having a central database in the 1960ies!). Actually, by blindly collecting data in such a data lake, one may quickly end up with a data swamp.

This means, a data lake should rather be treated as a starting point for data assimilation and data cleaning and data transformation steps rather than a final data architecture (it is not even an architecture). 

There exist a myriad of techniques and tools that can help in that process.

So in summary: calling "data lakes" a technology is too big of a word.