How to enable and ensure the responsible use of data?
Responsible Data Science
Using data in a responsible manner is an integral part of any Digital Society. Responsible data science promotes best practices that maximize the availability of high quality data while limiting the potential for misuse that could erode fundamental rights and undermine the public trust in digital technologies. Responsible data scientists take steps to make data they depend on findable, accessible, interoperable and reusable (FAIR) while ensuring the fairness, accuracy, confidentiality and transparency (FACT) of the algorithms and tools they create. Furthermore, their work must be placed in the context of broader social, legal, ethical aspects. These and related challenges are addressed in the program line on Responsible Data Science.
Key activities of the Responsible Data Science theme target:
- Reliable and trustworthy approaches for data engineering, data management, data science and modern machine learning and AI, interoperability and reusability;
- Human-empowering solutions for the digital society
- Algorithmic fairness, transparency, and explainability
- The social, legal, and ethical aspects of responsible data science
The research activities from the postdocs address different aspects from the FACT and FAIR principles, and their social, ethical and regulatory as well as legal embedding, in different contexts, to provide a fertile breeding ground for new research activities and collaborations and for broad impact. The disciplines and skills that are employed and advanced in this program line include: machine learning, artificial intelligence, data science, natural language processing, human computation, bioinformatics, law, ethics and critical data studies.
Academics that are working on finding solutions to societal challenges related to Responsible Data Science:
I am a Distinguished Professor of Data Science at Maastricht University. My research focuses on the development of computational methods for the responsible use and scalable integration of FAIR (Findable, Accessible, Interoperable and Reusable) data and services. My group combines semantic web technologies with machine learning and network analysis for drug discovery and personalized medicine. I also lead a new inter-faculty Institute of Data Science at Maastricht University whose focus is to bring together science, technology, and social, legal and ethical aspects to strengthen communities, accelerating scientific discovery, and improve health and well-being.
I seek to develop a transdisciplinary research and education programme that examines how data science and artificial intelligence can best be harnessed to tackle pressing issues in an increasingly digital society. A key part of my work will be to study data science methods to enhance multi-disciplinary collaboration, to create new streams of interdisciplinary education, and to identify effective means by which responsible data science research and innovative can be more tightly coupled for the benefit of a data science savvy society.
Frank van Harmelen is a professor in Knowledge Representation & Reasoning in the Computer Science department (Faculty of Science) at the Vrije Universiteit Amsterdam. Since 2000, he has played a leading role in the development of the Semantic Web, which aims to make data on the web semantically interpretable by machines through formal representations. He was co-PI on the first European Semantic Web project (OnToKnowledge, 1999), which laid the foundations for the Web Ontology Language OWL. OWL has become a worldwide standard, it is in wide commercial use, and has become the basis for an entire research community. In recent years, he pioneered the development of large scale reasoning engines. He was scientific director of the 10m euro EU-funded Large Knowledge Collider, a platform for distributed computation over semantic graphs with billions of edges. The prize-winning work with his student Jacopo Urbani has improved the state of the art by two orders of magnitude. He is scientific director of The Network Institute. In this interdisciplinary research institute some 150 researchers from the Faculties of Social Science, Humanities and Computer Science collaborate on research topics in computational Social Science and e-Humanities. He is a guest professor at the University of Science and Technology in Wuhan, China.
Dick den Hertog is professor of Business Analytics / Operations Research at Tilburg University. His research interests cover various fields in prescriptive analytics, in particular linear and nonlinear optimisation. In recent years his main focus has been on robust optimisation and simulation-based optimisation. He is also active in applying the theory in real-life applications. In particular, he is interested in applications that contribute to a better society. For many years he has been involved in research to optimise water safety, he is doing research to develop better optimisation models and techniques for cancer treatment, and recently he got involved in research to optimise the food supply chain for World Food Programme.
As a computer scientist trained in databases, my research has always been involved in how to effectively get meaningful information out of large datasets. Concentrating in particular on large sets of web data, most of my research has focussed on how to attaching meaning to web data, for example for the purpose of enabling web-based information systems to offer user-adapted or personalised information to their users. Being in the centre of data science, this means my research now is devoted to the theory and technology that enables developers and users of data-driven systems to trust the information that the systems provide.
With the Digital Society programme, we can increase the awareness of how data science plays a fundamental role in many of the various research efforts in studying the Digital Society. As joint universities, we can further develop the research around data science to reach that all research that applies large data sets and data science can effectively rely on the insights derived from the data and that thus all researchers that apply data science can do so in a responsible manner.
Mykola Pechenizkiy is a Full Professor at the department of Mathematics and Computer Science, Eindhoven University of Technology (TU/e), where he holds the Data Mining Chair. His research interests include data science, knowledge discovery and data mining, responsible analytics, including ethics/discrimination-awareness, context-aware predictive analytics, handling concept drift and reoccurring contexts, automation of feature construction and analytics on evolving networks. His core expertise and research interests are in predictive analytics and knowledge discovery from evolving data, and in their application to real-world problems in industry, medicine and education. At the Data Science Center Eindhoven, he leads the Customer Journey interdisciplinary research program aiming at developing techniques for informed and responsible analytics.
Corien Prins is Professor of Law and Information Technology at Tilburg Law School and was president of the Tilburg Institute of Law, Technology, and Society (TILT). Since 2017, she is Chair of the Netherlands Scientific Council for Government Policy (WRR)
Since 2009 she has been a member of the Royal Netherlands Academy of Arts and Sciences (KNAW), Chair of the Supervisory Board of Erasmus University Rotterdam and a member of the selection Advisory Committee of the parquet of the Supreme Court.
Linnet Taylor is Assistant Professor of Data Ethics, Law and Policy at the Tilburg Institute for Law, Technology, and Society (TILT). Her research focuses on global data justice – the development of a conceptual framework for the ethical and beneficial governance of data technologies based on insights from technology users and providers around the world.
Christof Monz is an associate professor in computer science at the Informatics Institute, University of Amsterdam, where he leads the Language Technology Lab. His research interests lie in the area of multilingual natural language processing and neural machine translation in particular. He was awarded an NWO Vidi grant to work on generative aspects of machine translation. Christof studied Computational Linguistics at the University of Stuttgart, Germany, and obtained his PhD in Computer Science from the University of Amsterdam. After working as a post-doc at the University of Maryland, College Park, USA, and as a lecturer at Queen Mary, University of London, UK, he returned to the University of Amsterdam.
Postdoctoral researchers working in the Responsible Data Science track.
Fairness, Confidentiality and Transparency for subsymbolic AI
One of the most fundamental benefits of deep learning is the ability to learn directly from raw data. From a grid of pixels or a stream of characters, we can build up a representation step by step, towards high-level semantic predictions. This end-to-end character means that there are often very few symbols involved. Often, the only symbolic knowledge included in a deep learning task is the classification or discrete action produced at the end of the pipeline.
Fairness, confidentiality and transparency are well-studied, and often well-understood subjects, but only from a symbolic perspective:
- Fairness is framed in terms of protected attributes in the data, such as race or gender. When these are not made explicit but hidden in sensory data (like an image), little is known about how to approach these problems, and how such methods translate to subsymbolic domains.
- Confidentiality, similarly, is well studied in the symbolic domain of feature-based machine learning. Differential privacy, in particular, gives a precise framework for how to treat a dataset in machine learning to ensure that no personally identifiable information can be recovered from the resulting model. Treatments exist for deep learning, but it is not clear how the requirement of privacy translates to the subsymbolic domain.
- Transparency, specifically explainability, shows the divide between symbolic and subsymbolic methods most acutely. Explanations, comprise a short number of discrete reasoning steps over a discrete, and limited space of symbols. However, the only truly performant learning methods we have in the domains of language, image, robotics, game-playing and recommendation work in a profoundly connectionist manner, often making very dense connections between many inputs and many outputs.
Thus, we believe that the interplay between symbolic and subsymbolic representations is key to creating machine learning models that satisfy the principles of FACT.
Structure and sparsity. Structure in machine learning refers to the methods used to convey prior knowledge about the domain to a learning system. At the extreme end, this takes the form of logical rules and reasoning steps that are known to be true, which the learning algorithm should learn to satisfy in its solution. These can be relaxed to inductive biases that create a preference for models with a certain structure, that can be overridden by the data.
The consequence of such inductive biases is often that a model becomes more sparse. Instead of connecting every input with every potential output, we can remove connections and restrict the model space in other ways. This creates a sparse model. Therefore, we believe studying sparsity allows us to create structure, which will allow us to create fair, confidential and transparent models.
FAIR, FACTS, and the law
What: We investigate the relationship and interplay of the FAIR and FACT principles with legal provisions (mainly at supranational level, with a particular focus on the European Union) concerning the use of data for different applications.
The aim of the research is to minimise undesired effects deriving from possible frictions between FAIR and FACT principles and legal provisions, as well to have an insight of the concrete modalities in which general principles developed by multiple stakeholders based on meta-legal concepts can interact with the law, for future applications in developing regulatory interventions and strategies for denew and/or disrupting technologies.
Why: General principles occupy an important role in guiding the research & development of technologies, as well as the relating applications and business practices. In the context of Information Technologies, and of digital and data-based industries in particular, general principles have often been developed by the stakeholders participating to said industries, often thanks to the interplay of state-actors and the exchange with the fields of ethics and law. General principles offered several advantages in the last few decades as the fast-developing new technologies required flexibility and technical expertise that often more traditional regulatory interventions could not offer. General principles have also served as a basis to develop law-based interventions, in an alternation of soft and hard regulation. A typical example of this alternation are the Fair Information Practice Principles, developed by the FTC following the suggestions of a group of experts and stakeholder in 1998, and offering guidelines for the use and management of data in relation to the issues of (informational) privacy and accuracy. The FIPP principles, in fact, served as a basis for the development of regulatory interventions in the field of data and privacy, including the Directive 95/46/EC, the precursor of the current General Data Protection Regulation. A similar process has occurred with regard to Privacy by Design, a term developed by Ann Cavoukian and formalised by Canadian authorities. General principles represent the point where the positions of the industry, technology, civil society, ethical studies and law meet. Due to the complexity of the reality to which general principles relate, however, the risk of undesired effects deriving from their application and their interaction with other regulatory intervention shall be taken into consideration. For this reason, we investigate the interplay between the FAIR and FACT principles, and the laws applicable to certain data uses and practices, focusing in particular on two aspects: i) to what extent are the FAIR and FACT principles compatible with the laws and regulations concerning the use and application of data, in particular with regard to the fields of Big Data and Data Protection; ii) How, if at all, do FAIR and FACT principles translate into legal requirements for certain industries (to be selected).
How: The research is based on a thorough legal research, carried out with a functional approach by leveraging the expertise present in the Digital Society and its network, in order to concretely analyse the issues, necessities, and interests underlying FAIR and FACT and the way they are concretely implemented into certain industries and/or technologies.
Outcome: Multidisciplinary paper(s) concerning the relationship between FAIR and FACT principles with regulation via legal provisions in specific industries chosen as case-study, based also on the expertise present in the Digital Society – Responsible Data Track. Possible grant proposal(s) to be discussed and developed at a later stage.
Fair, Transparent and Human-centered Information
What: We align with the goals of the Digital Society program in terms of FAIR and FACT principles along four main aspects: fairness, transparency, confidentiality and re-usability. We make use of explanations to address the aspects of fairness and transparency in algorithmic decision. Furthermore, we align current algorithmic and technological approaches to user-driven requirements, where the research is guided by users’ needs and purposes and the outcome is explainable. We address the re-usability aspect by making the data and the code produced during this research available for the community, for re-usability and replicability purposes. We also aim to provide free, transparent access to the requirements gathered through various user studies in a fully confidential manner (i.e., responses from individual participants can not be identified). Our main use-case is focused on media, and more exactly on identifying credibility markers for news broadcasts and on the summarisation of these news broadcasts.
Why: News, either traditional text-based or broadcasts, receive tremendous attention due to their potential impact on the people that consume these news. News are prone to appear biased and to contain misinformation due to (1) the media source, (2) the political orientation of citizens consuming them and (3) different algorithmic decisions, such as filtering, personalization and generation of data. These aspects are currently investigated in existing research projects within the Big Data Route of the Dutch National Research Agenda, namely Capture Bias and Fair News. The research we perform in the Digital Society Responsible Data Science track compliments and extends these projects by (1) investigating various credibility markers of news broadcasts, (2) incorporating user-driven algorithmic goals, i.e., user preferences, purposes, semantics, etc., and by (3) generating explanations for credibility assessments and algorithmic decisions.
Even though people are surrounded with effortless access to information, it is not trivial to consume all this information. To understand and contextualize a topic an overwhelming amount of video material needs to be watched. Therefore, we can facilitate better (e.g., representative, balanced) overviews by creating news broadcasts in smaller units, i.e., video summaries, which are easier to distribute and consume on social media. However, additional information to the generated video summaries should be provided, to comply with the fairness and transparency goals: (1) understand and make explicitly clear how well a given summary represents the original video and what is its potential for misinformation, (2) produce explainable video summaries (e.g., show which are the covered and not covered topics/viewpoints in the video summary with regard to the original video) and (3) to assess the quality of the video summaries in terms of misinformation and bias, as well as in terms of how well they represent different aspects of the original video as a whole.
How: Within the project, we investigate the suitability of using data enrichment, data semantics, and data provenance to support responsible data exploration and understanding. Data semantic and enrichment (i.e., what is happening in the video (events)?, who appears (participants)?, what locations (places)?, when it happens (time period)?) will be used to generate visual explanations for understanding which concepts are not included in a video summary and ultimately, to assess how representative a video summary is, with regard to the original video.
Outcome: Explainable, human-centered information
FAIR and FACT data science principles within the (neuro)science domain
The implementation of the FAIR and FACT data science principles, which are important within the RDS research line, at Maastricht University can be separated into research and education & training activities.
The BReIN research program
What: Within the context of the BRightlands e-Infrastructure for Neurohealth (BReIN) program, we aim to create a digitized data management environment to collect, standardize, and store patient and experimental (meta) data, in the context of sporadic Alzheimer’s disease, in accordance with the FAIR principles (“Findable, Accessible, Interoperable, Reusable “). An important aspect here relates to the adequate protection of the privacy of affected patients. This is where the FACT (“Fairness, Accuracy, Confidentiality and Transparency”) data science principles are applied.
Why: Sporadic Alzheimer’s disease is the most common form of the disease and can affect adults at any age, but usually occurs after age 65. Sporadic Alzheimer’s disease occurs due to a complex combination of our genes, our environment, and our lifestyle. The generation of a central database for sporadic Alzheimer’s disease, in which both the data files and metadata is captured, will allow for the discovery of hidden and new relationships between different datasets within the database. This could possibly lead to new discoveries of how environmental and lifestyle factors could affect molecular mechanisms within our body resulting in sporadic Alzheimer’s disease.
How: Within the BReIN project, data files are stored within a central database. Metadata, describing the characteristics of patients and experiments, are captured using a range of existing curation tools together with domain-specific ontologies and controlled vocabularies. In addition, to the storage and curation of (meta) data, a computational workflow to generate analysis results in a FAIR and reproducible manner is created.
Outcome: A data rich and well-annotated database (according to the FAIR principles) for sporadic Alzheimer’s disease will help researchers working in the field to retrieve a better overview on what is already known about the disease. New connections can be made between different types of data which otherwise would not have been discovered. The data will also serve as a basis for machine-learning algorithms for comparing diseased patients and controls. By analyzing hundreds of parameters simultaneously will reveal new biomarkers of the disease. By applying the FACT data principles to the data, privacy of patients can be pertained.
Stay updated by subscribing
to our newsletter.