- VLDB Endowment
Answering Web Questions Using Structured Data – Dream or Reality?
Fernando Pereira (Google), Anand Rajaraman (Kosmix Inc), Sunita Sarawagi (IIT Bombay), William Tunstall-Pedoe (True Knowledge), Gerhard Weikum (MPI Saarbruken)
Moderator: Alon Halevy (Google)
Abstract. The question of which role structured data can play in Web search has been raised from the early days of the Web. On the one hand, structured data can be used to answer factual queries. On the other, large amounts of structured data can be used to better organize web-content and therefore to improve search on a wide range of queries. While the Information-Retrieval approach to web search has been the clear winner to date, in recent years there has been renewed activity in trying to leverage structured data in web search. The efforts encompassed both research and commercial products. Perhaps the key aspects of these new efforts is that they consider breadth of structured data, rather than individual verticals. This panel discussion will consider some of the fundamental questions regarding the role of structured data in search and will examine the significance of recent advances. In particular, some of the questions we will consider are the following:
- Is structured data at all even necessary or relevant in the context of web search? Search engines have thus far fared admirably well without doing much in the way if structured data. If so, why is it relevant now when it hasn't been relevant to date?
- Is it even feasible to harness structured data in a broad horizontal fashion? Are there applications apart from standard query answering? Most successful efforts to date around structured data have been in verticals with highly constrained query types.
- Exactly what do we mean when we say structured data anyway? How does this relate to the Semantic Web and the Deep Web?
- What techniques are used to build the large ontologies that underly current systems, and what evidence do we have that they work? Why do they work? Why were previous attempts to build monolithic ontologies less successful?
- How do these systems collect the data for their knowledge bases? Is it important to make sure the data is high-quality and/or curated? What role does inference play in such systems? Can we combine evidence from multiple sources on the Web to give higher-confidence answers to queries? What level of quality does the knowledge base need to adhere to in order to offer useful search? Has information extraction had any success here or is it all manual work?
- How can structured data best be used to improve the results of search queries? In particular, would you integrate techniques for answering factual queries with a traditional search engine? Are there enough factual queries that current search engines don't answer well?
- What systems issues do we need to consider? How do you index such a large and varied collection of structured data? How do you process queries efficiently? Do you need offline reasoning/view computation?
- Why couldn't we build these systems 15 years ago when the idea was first proposed? Was there some critical technology that became mature? Is it all because we now have Wikipedia? What do we need to really make structured data a success on the Web?
How Best to Build Web-Scale Data Managers? A Panel Discussion.
Philip A. Bernstein (Microsoft), Daniel J. Abadi (Yale), Michael J. Cafarella (U. of Washington), Joseph M. Hellerstein (U.C. Berkeley), Donald Kossmann (ETH Zürich), Samuel Madden (M.I.T.)
Abstract. Many of the largest database-driven web sites use custom web-scale data managers (WDMs). On the surface, these WDMs are being applied to problems that are well-suited for relational database systems. Some examples are the following:
- Map-Reduce, Hadoop, and Dryad are used to process queries on large data sets using sequential scan and aggregation. Hive is a data warehouse built on Hadoop.
- Google’s Bigtable is used to store a replicated table of rows of semi-structured data.
- Amazon’s Dynamo is used to store partitioned, replicated databases of key-value pairs. Cassandra is similar.
- Object caching systems are used instead of a persistent store, such as memcached, Oracle’s Coherence, and Microsoft’s Velocity project.
These WDMs have challenging requirements that are not met by current relational database products. They need to scale out to thousands of machines, offer high availability even on unreliable commodity hardware, and be completely self-managing. To make it easier to meet these requirements, these WDMs offer much less functionality than a relational database system. Yet the functionality is apparently enough to attract a wide following. The differences between these WDMs and relational database systems are striking. This panel will explore these differences. In particular, it address the following questions:
- What should the database field be doing to satisfy the needs of web-scale data management?
- Many web-scale WDMs were built primarily by systems groups whose specialty is not classical database management. (One exception is PNUTS.) What does this say about the database field? What should we be doing differently?
- Do web-scale data management problems require very limited functionality to satisfy other requirements? Or is this just a symptom of immature technology that will improve?
- Many of these WDMs abandon ACID transactions and require the application to deal with data consistency. Is this the only hope to achieve satisfactory scale-out?
- Many developers prefer these limited-functionality WSDMs to classical DBMSs. Why? How do we increase functionality without sacrificing ease of use?
- Is it practical to obtain a competitive WDM by improving the scalability, availability and manageability of a classical DBMS?