name: inverse layout: true class: center, middle, inverse --- # CrowdDB: Answering Queries using Crowdsourcing Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, Reynold Xin Original slides by Anil Shanbhag Updated by Pratyaksh Sharma --- layout: false # Outline - Motivation - Crowdsourcing - Design Considerations - Overview of CrowdDB - CrowdSQL - User Interface Generation - Query Processing - Experimental Results --- template: inverse # Motivation --- ## Limitations of traditional RDBMS - Missing Data A key limitation of relational technology stems from the Closed World Assumption. Information that is not in the database is considered to be false or non-existent. For example, the query: ```SQL SELECT market_capitalization FROM company WHERE name = "I.B.M."; ``` will return an empty answer if the company table instance in the database at that time does not contain a record for "I.B.M." Many reasons why such a record may be missing: - a data entry mistake may have omitted the I.B.M. record, or - the record may have been deleted, or - the record was entered incorrectly, say, as "I.B.N.", or - the record was entered correctly, but the name used was "International Business Machines". --- ## Limitations of traditional RDBMS - Fuzzy Comparisons Consider a query to find the best among a collection of images to use in a motivational slide show: ```SQL SELECT image FROM picture WHERE topic = "Business Success" ORDER BY relevance LIMIT 1; ``` Unless the relevance of pictures to specific topics has been previously obtained and stored, there is simply no good way to ask this question of a standard RDBMS. The issue here is one of judgment: one cannot reliably answer the question simply by applying relational operators on the database. --- ## In contrast... - Finding new data People, aided by tools such as search engines and reference sources, are quite capable of finding information that they do not have readily at hand. - Comparing data People are skilled at making comparisons that are difficult or impossible to encode in a computer algorithm **Solution**: Harness Human Computation for solving problems that are impossible or too expensive to answer correctly using computers. ### Is it possible leverage such human resources to extend the capabilities of database systems ? --- template: inverse # Introduction to Crowdsourcing --- ## Crowdsourcing As defined by Merriam-Webster: > the process of obtaining needed services, ideas, or content by soliciting contributions > from a large group of people, and especially from an online community, > rather than from traditional employees or suppliers Let us consider a few simple examples *Example 1* : Given an image containing text, get the text from it (aka CAPTCHA) *Example 2* : Given a university and department name, find the link to department webpage *Example 3* : Given a set of photographs of people from a disaster, and pictures submitted by family members of lost individuals, perform a fuzzy join across both sets, using humans to determine equality All these examples are trivial for a human to do but expensive to do as a computer --- .left-column[ ## Crowdsourcing ### - Platform ] .right-column[ - A crowdsourcing platform creates a marketplace on which requesters offer tasks and workers accept and work on the tasks - The largest among these is Amazon Mechanical Turk  - Platform had 500,000 workers in 2011 and continuously growing. ] --- .left-column[ ## Crowdsourcing ### - Platform ### - Definitions ] .right-column[ - *HIT* : A Human Intelligent Task, or HIT, is the smallest entity of work a worker can accept to do. HITs contain one or more jobs. Eg: Given a university and department name, find the url of department webpage - *Assignment* : Every HIT can be replicated into multiple assignments. Any particular worker processes at most a single assignment for each HIT. Typical to have 3 or 5 assignments per HIT. - *HIT Group (or HIT Type)* : Group of similar HITs. Grouped for convenience of 'turkers'. Eg: We can post 1000 HITS to get the URL of 1000 different departments spread across universities. ] --- .left-column[ ## Crowdsourcing ### - Platform ### - Definitions ### - Workflow ] .right-column[ - Package the jobs comprising of information into HITs, determines the number of assignments required for each HIT and posts the HITs - Optionally specify requirements that workers must meet in order to be able to accept the HIT - AMT Groups compatible HITs into HIT Groups and posts them so that they are searchable by workers - A worker accepts and processes assignments - Requesters then collect all the completed assignments for their HITs and apply whatever quality control methods they deem necessary. More on this later. ] --- .left-column[ ## Crowdsourcing ### - Platform ### - Definitions ### - Workflow ### - API ] .right-column[
A requester can automate his or her workflow of publishing HITs, etc. by using AMT's web service or REST APIs. The relevant interfaces relevant to CrowdDB are: - *createHIT(title, description, question, keywords, reward, duration, maxAssignments, lifetime) → HitID* - *getAssignmentsForHIT(HitID) → list(asnId, workerId , answer)* - *approveAssignment(asnID) / rejectAssignment(asnID)* - *forceExpireHIT(HitID)* ] --- ## What is CrowdDB? - Crowd-enabled DBMS, or Hybrid Human-Machine DBMS - *Query Frontend* - **Declarative interface** to the crowd: Supports SQL with minimal extensions (CrowdSQL) - **Physical data independence**: application developers can write SQL queries without having to focus on which operations will be done in the database and which will be done by the crowd - Existing SQL queries can be run on CrowdDB, in many cases will return more complete and correct answers than if run on a traditional DBMS - **Backend**: - New crowd-sourced query operators and plan generation techniques that combine crowdsourced and traditional query operators - Methods for automatically generating effective user interfaces for crowdsourced tasks --- template: inverse # Design Considerations --- # Design Considerations - **Performance and Variability** Humans and machines differ greatly in type, speed and cost of work done People show tremendous variability from person to person and over time Need appropriate query planning, fault tolerance and answer quality - **Task Design and Ambiguity** Carefully designed user interface with human readable instructions are needed. - **Open vs Closed World** In CrowdDB closed world assumption doesn't hold. Any one query operator could conceivably return an unlimited number of answers. Implications on query planning, cost and answer quality. --- # Design Considerations - **Affinity and Learning** Workers develop relationship with requesters and skills for certain types of HITs. Not uncommon to find workers doing only image classification. Hesitant to do tasks from requesters who don't provide well-defined tasks / pay appropriately. CrowdDB design to take longer-term view on task and worker community development. - **Relatively Small Worker Pool** Pool of workers available to work for any one requester is surprisingly small. A number of reasons: design of the AMT web site, and others having to do with the affinity and learning issues. --- template: inverse # Overview of CrowdDB --- ## Overview of CrowdDB  Left side of the figure are traditional QO parts: parsing, optimization and execution. Right side contain components used to extend the traditional DB system to get human generated input. --- template: inverse # CrowdSQL --- # CrowdSQL CrowdSQL is a SQL extension that supports crowdsourcing. Supports two main use cases: - allow crowdsourcing missing data - support subjective comparisons --- .left-column[ ## CrowdSQL ### Incomplete Data ] .right-column[ *SQL DDL Extensions* Two scenarios: - Specific attributes could be crowdsourced - Entire tuples could be crowdsourced Introduce a CROWD keyword to solve both. Let us revisit the initial example of finding the department webpage url. In CrowdSQL it translates into ``` CREATE TABLE Department ( university STRING, name STRING, url CROWD STRING, phone STRING, PRIMARY KEY (university, name) ); ``` The crowd keyword indicates that the url attribute will be got using crowdsourcing. ] --- .left-column[ ## CrowdSQL ### Incomplete Data ] .right-column[ Let's go one step further. We want to get the details of all the professors in a department. This translates into: ``` CREATE CROWD TABLE Professor ( name STRING PRIMARY KEY, email STRING UNIQUE, university STRING, department STRING, FOREIGN KEY (university, department) REF Department(university, name) ); ``` Notice the CROWD keyword on table. It indicates that all tuples in this relation will be crowdsourced. CrowdDB does not impose any limitations with regard to SQL types and integrity constraints: referential integrity constraints can be defined between two CROWD tables, two regular tables, and between a regular and a CROWD table in any direction. However there is one exception: CROWD tables must have a primary key so that CROWDDB can infer if two workers input same tuple. ] --- .left-column[ ## CrowdSQL ### Incomplete Data ] .right-column[ **_SQL DML Extensions_** A special CNULL value indicates data in CROWD columns that should be crowd-sourced when needed as part of processing a query. During INSERT/UPDATE, crowd columns can also be initialised. All non-initialised crowd columns are set to CNULL. ``` INSERT INTO Department(university, name) VALUES ("UC Berkeley", "EECS"); ``` Consider example above, it sets `url` to CNULL ] --- .left-column[ ## CrowdSQL ### Incomplete Data ] .right-column[ **_ Query Semantics _** CrowdDB supports any kind of SQL query on CROWD tables and columns CrowdSQL specifies that tables are updated as a side-effect of crowdsourcing. Let us take two examples based on the tables created previously: ``` SELECT url FROM Department WHERE name = "Math"; ``` In this, the url column would be implicitly updated with the crowdsourced URL. ``` SELECT * FROM Professor WHERE email LIKE "%berkeley%" AND dept = "Math"; ``` In this query missing values in the email column would be implicitly populated and new professors of math department would be implicitly inserted as a side-effect of processing. ] --- .left-column[ ## CrowdSQL ### Incomplete Data ### Subjective Comparisons ] .right-column[ Beyond finding missing data, subjective comparisons important use of CrowdDB Two new operators: CROWDEQUAL and CROWDORDER **CROWDEQUAL(~=)** takes two parameters (an lvalue and an rvalue) and asks the crowd to decide whether the two values are equal ``` SELECT * FROM department WHERE name ~= "CS"; ``` Here the query writer asks the crowd to do entity resolution with the possibly different names given for Computer Science in the database like 'Computer Science', 'CSE', etc. ] --- .left-column[ ## CrowdSQL ### Incomplete Data ### Subjective Comparisons ] .right-column[ **CROWDORDER** is used whenever the help of the crowd is needed to rank or order results The CrowdSQL query below asks for a ranking of pictures with regard to how well these pictures depict the Golden Gate Bridge ``` CREATE TABLE picture ( p IMAGE, subject STRING ); SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which picture visualizes better %subject"); ``` As with missing data, CrowdDB stores the results of CROWDEQUAL and CROWDORDER calls so that the crowd is only asked once for each comparison. This caching is equivalent to the caching of expensive functions in traditional SQL databases ] --- template: inverse # User Interface Generation --- .left-column[ ## User Interface Generation ### - Overview ] .right-column[ ### UI is important! A clear, unambiguous user interface helps greatly in improving accuracy. Two step process: - **Compile-time** CrowdDB creates templates to crowd-source missing information from all CROWD tables and all regular tables which have CROWD columns. JS is generated in additon to HTML to do type checking. - **Runtime ** These templates are instantiated at runtime by filling known field values from a tuple into the HTML form. ] --- .left-column[ ## User Interface Generation ### - Overview ### - Basic Interfaces ] .right-column[ Basic UI for crowd tasks  - (a) is our earlier example where we want to crowdsource url - (b) does entity resolution using CROWDEQUAL - (c) is our earlier example to rank a set of images based on how well they visualize subject (here Golden Gate Bridge) ] --- .left-column[ ## User Interface Generation ### - Overview ### - Basic Interfaces ] .right-column[ The following optimizations are used: - Batching: Get information of several tuples at once (Eg: URL of Elec, CS, EP of UC-Berkeley). Assumption: cheaper to input two pieces of information of the same kind in a single form rather than separate forms - Prefetching: Consider, say both the department and email of a professor are unknown, but only the email of that professor is required to process a query, it might make sense to get the department too. Interfaces for CROWDEQUAL(Fig (b)) and CROWDORDER(Fig (c)) can also be batched. ] --- .left-column[ ## User Interface Generation ### - Overview ### - Basic Interfaces ### - Multi-Relation Interfaces ] .right-column[ Crowdsourcing relations with foreign keys require special considerations - If foreign key references non-CROWD table, the generated user interface shows a select box and for larger lists a ajax based suggest method - If foreign key references CROWD table, there are two types of interfaces which are used: - **Normalised Interface** : The worker inputs the value of foreign key but no other attributes of referenced tuple  ] --- .left-column[ ## User Interface Generation ### - Overview ### - Basic Interfaces ### - Multi-Relation Interfaces ] .right-column[ - **Denormalised Interface** : There is a select box and an add button which allows the worker to input a new department  - To source entirely new tuples, the non-key attributes can be preset via WHERE clause, autosuggest while typing and an option to say no new professor entry present. If many workers say no new professor entry present, we can stop. ] --- template: inverse # Query Processing --- .left-column[ ## Query Processing ] .right-column[ The traditional database model extended: - SQL extended to CrowdSQL - Crowd Operators for crowdsourcing - Optimizer that handles crowd operators. **CPU time taken << time taken by crowd to answer** => goal of optimizer is to find plan which results in least number of queries to Crowd. ] --- .left-column[ ## Query Processing ### Example ] .right-column[   ] --- .left-column[ ## Query Processing ### Example ### Crowd Operators ] .right-column[ ###CROWDPROBE - Crowdsources missing information of CROWD columns (i.e., CNULL values) and new tuples - Quality control carried by majority vote. If not majority achieved at max hits, choose randomly from most frequent ones. - In the case of new tuples, finding majority impossible. The DB reposts the tasks with only primary key filled in. ] --- .left-column[ ## Query Processing ### Example ### Crowd Operators ] .right-column[ ###CROWDJOIN - Implements a nested-loop join where atleast one of the tables is a CROWD table - For every tuple of outer relation, creates HIT's to find the inner tuples ###CROWDCOMPARE - Implements the CROWDEQUAL and CROWDCOMPARE functions - Typically used inside a traditional operator like sorting ] --- .left-column[ ## Query Processing ### Example ### Crowd Operators ### Physical Plan Generation ] .right-column[ The basic functionality of all Crowd operators is the same. - Initialized with a user interface template and the standard HIT parameters - At runtime, they consume a set of tuples - Depending on the Crowd operator, crowdsourcing can be used to source missing values of a tuple or to source new tuples. - Batch HITs. Create HIT Groups. - Consume tuples from crowd and do quality control Quality control is currently carried out by a majority vote. The number of workers assigned to each HIT is controlled by an Assignments parameter. ] --- template: inverse # Experiments and Results --- ## Microbenchmarks - Simple tasks requiring workers to find and fill in missing data for a table with two crowdsourced columns: ```SQL CREATE TABLE businesses ( name VARCHAR PRIMARY KEY, phone_number CROWD VARCHAR(32), address CROWD VARCHAR(256) ); ``` - Table was populated with the names of 3607 businesses in 40 USA cities - Studied the sourcing of the phone_number and address columns using the following query: ```SQL SELECT phone_number, address FROM businesses; ``` - Results are highly dependent on the properties of the AMT platform at a particular point in time --- ## Experiment 1A: Response Time vs HIT Group size As the HIT group size increases, the time to get first x responses decreases. Larger HIT groups mean more tasks to attempt. HITs are repetitive tasks and there is an initial overhead of learning how to do the task. Hence larger HIT groups give higher payoffs and attract more turkers.  --- ## Experiment 1B: %Completion vs HIT Group size The percentage of HIT's completed in the 30 minutes increases and then decreases. Exhibits tradeoff between throughput and completion %.  --- ## Experiment 2: %Completion vs Reward Paying more than 1 cent per task attracts more workers. However beyond 2 cents, there is barely any difference.  --- ## Experiment 3: Worker Affinity and Quality The graph below shows the work distribution. It is highly skewed. Total 750 workers. - Tasks acquire a community - The authors thought the ones doing more hits will have lesser error but this behaviour not seen in experiments.  --- ## Experiment 3: Worker Affinity and Quality ### Further observations: - Reward had no noticeable impact on the error rate of the worker - Results in the previous slide include HITs from the full reward range of 1 to 4 cents - The error rates did not depend on the HIT Group size --- ## Complex Queries: Entity Resolution on Companies - Used a simple company schema with two attributes, company name and headquarter address, and populated it with the Fortune 100 companies. - Ran the following query to test entity resolution in CrowdDB: ```SQL SELECT name FROM company WHERE name~="[a non-uniform name of the company]" ``` - Each HIT involved comparing ten company names with one of the four "non-uniform names". Furthermore, each HIT had three assignments.  --- ## Complex Queries: Ordering Pictures Query is to sort the pictures for Golden Gate bridge. They rankings are close to ranking by experts.  --- --- # References ### CrowdDB: answering queries with crowdsourcing Michael J. Franklin, Donald Kossmann, Tim Kraska, S. Ramesh, Reynold Xin Suggested Reading: ### CrowdScreen: algorithms for filtering data with humans Aditya Parameswaran, et al. ### Crowdsourced databases: Query processing with people A Marcus, et al. ### Deco: Declarative Crowdsourcing Aditya PArameswaran, et al.