VLDB2020: Keynote Speakers
As we witness the data science revolution, each research community legitimately reflects on its relevance and place in this new landscape. The database research community has at least three reasons to feel empowered by this revolution. This has to do with the pervasiveness of relational data in data science, the widespread need for efficient data processing, and the new processing challenges posed by data science workloads beyond the classical database workloads. The first two aforementioned reasons are widely acknowledged as core to the community’s raison d’être. The third reason explains the longevity of relational database management systems success: Whenever a new promising data-centric technology surfaces, research is under way to show that it can be captured naturally by variations or extensions of the existing relational techniques. Like the Star Trek’s Borg Collective co-opting technology and knowledge of alien species, the Relational Data Borg assimilates ideas and applications from connex fields to adapt to new requirements and become ever more powerful and versatile. Unlike the former, the latter moves fast, has great skin complexion, and is reasonably happy. Resistance is futile in either case.
In this talk, I will make the case for a first-principles approach to machine learning over relational databases that guided recent development in database systems and theory. This includes theoretical development on the algebraic and combinatorial structure of relational data processing. It also includes systems development on compilation for hybrid database and learning workloads and on computation sharing across aggregates in learning-specific batches. Such development can dramatically boost the performance of machine learning.
This work is the outcome of extensive collaboration of the author with colleagues from relationalAI (https://www.relational.ai), in particular Mahmoud Abo Khamis, Molham Aref, Hung Ngo, and XuanLong Nguyen, and from the FDB research project (https://fdbresearch.github.io/), in particular Ahmet Kara, Milos Nikolic, Maximilian Schleich, Amir Shaikhha, and Haozhe Zhang.
The need for responsible data management intensifies with the growing impact of data on society. One central locus of the societal impact of data are Automated Decision Systems (ADS), socio-legal-technical systems that are used broadly in industry, non-profits, and government. ADS process data about people, help make decisions that are consequential to people’s lives, are designed with the stated goals of improving efficiency and promoting equitable access to opportunity, involve a combination of human and automated decision making, and are subject to auditing for legal compliance and to public disclosure. They may or may not use AI, and may or may not operate with a high degree of autonomy, but they rely heavily on data.
In this talk I hope to convince you that the data management community should play a central role in the responsible design, development, use, and oversight of ADS. I outline a technical research agenda and also argue that, to make progress, we may need to step outside our engineering comfort zone and start reasoning in terms of values and beliefs, in addition to checking results against known ground truths and optimizing for efficiency objectives. This seems high-risk, but one of the upsides is being able to explain to our children what we do and why it matters.
This talk will have two parts. The first part, on “out-of-order execution” algorithms, is a long-term vision that has become reality, and the second part on COVID-19 information, is the current reality that may lead to future advances. First, I will talk about “out-of-order execution” algorithms that we have been working on for more than 10 years. The idea was so simple, but it took years to understand the essence of the out-of-order execution principle. We have verified significant speedups for a variety of queries and datasets over disk-based and flash-based database systems. A practical application enabled by the out-of-order execution is a healthcare data platform supporting interactive analytics on country-scale insurance claims (approximately two hundred billion records) in Japan. The out-of-order execution reduced the typical query response time from many days to a few minutes, enabling active and productive use by medical and public administration researchers to improve treatments based on the outcomes from real data of entire country. Second , I will describe some of our recent research advances for obtaining and managing COVID-19 information in several urgently useful application areas.