Flexible Information Integration:
A Note to the BIOSPICE DARPA Community

Val Tannen
University of Pennsylvania
New data sources materialize. Newly invented data formats are put in use. The integrated views requested by clients change often. In such an environment we need an information integration technology that can survive drastic changes gracefully. For the purposes of this note I will call this flexible information integration.

The fundamental trade-off in information integration is between availability and freshness (some people prefer to call this trade-off quality of service vs. quality of data, QoS vs. QoD). Availability is best supported by warehousing the integrated data. Freshness is best supported when the data is integrated on demand (on-the-fly). That is, each client query is answered by querying in turn the appropriate data sources and then processing the answers into an answer for the client. Availability is often considered more important and warehousing solutions are more common. But the need for flexibility changes this "equation".

It is difficult to do flexible warehousing because this may require changing the structure (schema) of the warehouse and changing the programs that update/refresh the warehouse once too often. It is significantly easier to do flexible integration on demand provided that the integration and transformation "process" (or "logic") is described clearly and concisely in a high-level language. The flexible paradigm that I am advocating here starts with an architecture in which actual data integration is performed by software components called mediators. However, its most important aspect is the use of high-level descriptions for what the mediators do.

By "high" level I mean the level of, say, SQL. (After all, we use SQL itself to define relational database views.) Changing SQL clauses in response to changes in the schema of the database or in the structure of the desired output is not that hard. It is certainly easier to change an SQL program that has a few lines than to change a thousand line program in a general-purpose programming language. High-level languages are better at "self-documentation", that is, the code itself explains what its action is. There are even better options than SQL, for example the ODMG standard query language, OQL.

The flexible information integration architecture we are discussing is sketched Figure 1 below. The clients see only a high-level interface, an ontology. The term is normally used in conjunction with semantic-rich approaches (eg., Ontolingua) but in this context anything against which the clients can ask queries will work (eg., relational or object-oriented schemas). Note that the ontology describes information that is virtual, while the actual information is in the data sources.


The correspondence between the data in the sources and the virtual data in the ontology is described by one, or perhaps several, small MDL (Mediator Definition Language) programs. These are the high-level descriptions of the transformation and integration process mentioned earlier. Different MDLs have been proposed in the literature and one example is given in the attached paper on K2. The MDL descriptions are "composed" by the mediator with incoming client queries, and the result is decomposed into queries for each data source, plus some common final integration/transformation to be done in the mediator on the answers of these data source queries. All this results in an answer to the original client query.

It is important to point out that this approach requires a good understanding of the class of queries that are supported and especially a certain amount of query optimization performed within each mediator.

Warehousing has its place also in this approach. Once the requirements for a portion of the data are stable enough, warehousing can be used to improve availability and performance. This paradigm permits such a transition and makes it transparent to the clients.

Figure 2 below suggests such a scenario. Suppose that some of the data that is integrated/transformed by the MDL program A (see Figure 1) is warehoused. This can be done without changes to the ontology (and therefore client queries need not change either). The MDL program A is decomposed into an MDL program B that describes a new mediator for which the warehouse is yet another data source and an MDL program C whose "output" is the data to be warehoused, so in fact it can describe a mediator used for loading the warehouse.


I learned about integration on demand from Wiederhold's mediator architecture. The idea of describing the functioning of mediators in high-level languages appeared independently in several information integration projects, such as Stanford's Tsimmis and Penn's Kleisli. I am attaching a paper that gives some pointers for Kleisli, which evolved at Singapore's Kent Ridge Digital Labs into a mature system that in turn became part of a commercial product: the Discovery Hub tm System Data Integration Middleware Platform from GeneticXchange Inc . The main goal of the attached paper however is to summarize the features of K2, an information integration system that succeeded Kleisli at Penn and is used as part of a research collaboration with GlaxoSmithKline.

Some URLs:
http://sdmc.krdl.org.sg/bic/projects
http://sdmc.krdl.org.sg/kleisli
http://www.geneticxchange.com
http://www.cis.upenn.edu/~sharker/K2_site

      Sitemap