Members of the public sign-up to be Data Detectives and then work with other Data Detectives to report and vet data sharing arrangements found on the Internet. Data Detectives are responsible for content on theDataMap™.
To sign-up, a person provides an email address (name is optional). The email address is necessary to facilitate communication between Data Detectives, so the project validates a given email address by sending an email message to the address containing a code and instructions to enter the code at thedatamap.org to complete enrollment. There are many sites on the Internet providing free email addresses instantly, for those wishing to participate discreetly.
Once enrolled, a Data Detective can submit a Report that claims a data sharing arrangement exists between two entities. On the data map, a Report describes an edge and two nodes involved in the sharing. The Report may require adding the edge and one or both nodes to the existing data map. The Report must include a URL that the Data Detective asserts as evidence of the reported sharing. That URL must be publicly available and be the most specific URL to directly access the content. For example, if the supporting documentation is accessible on website "throck.com" at "throck.com/details.html" then the URL on the Report would be "throck.com/details.html" and not "throck.com". After the Report is generated, it is added to a Repository of Reports maintained by the site.
When a Data Detective selects to play the Vetting Game he will be given a randomly selected Report from the Repository, which he may accept to review or not. A review means that the Data Detective examines the claim made in the Report to assert whether the contents found at the URL actually support the data sharing claim (i.e., the edge between the two nodes) and that the URL and its content seem credible.
Data Detectives vet Reports by voting either to accept, reject, or question the Report. A vote of accept means the Data Detective reviewed the content and found it credible support for the sharing claim. A reject means the Data Detective did not find the content sufficient support. A question signals the Data Detective is uncertain and has concerns that are commented in his response. (Note: The responses accept/reject/question may be discrete values or provided on a continuous scale -1...+1 depending on ranking system being used by the system at the time.) Each Data Detective, node and edge has a talk page on which any Data Detective can write comments to facilitate communication.
The Repository consists of Reports contributed by Data Detectives, as described above. It also consists of seeded Reports, which are already vetted by an external expert team to be good or bad. A Data Detective may therefore be asked to review a Report known to be good to which he should accept, to be bad to which he should reject, or unknown to which his response will help classify the report.
Data Detectives as Classifiers
In computer science terms, each Data Detective is a classifier. To determine how good a classifier he may be, the system presents him with reports to classify where the proper classification (accept/reject) is known beforehand. Based on his performance with known reports, the system learns how to assess his accuracy, so when he is presented with a new report for which the classification is not known, the system can weight his response accordingly.
As classifiers, a natural way to measure the performance of a Data Detective is to compare his rate of giving the correct versus incorrect answers by computing the area under his personal ROC curve. Here are the mechanics of the computations. Each decision to accept/reject a Report known already to be good or bad has one of four outcomes. The Data Detective could: accept a good report (true positive), reject a bad report (true negative), accept a bad report (false positive), or reject a good report (false negative). The true positive rate is the number of true positives divided by the sum (true positives + false negatives). The false positive rate is the number of false positives divided by the sum (false positives + true negatives).
The ROC curve is the plot of his true positive rate against his false positive rate. The figure below provides an example. The larger the area under the curve, the better the classifier. While project researchers will compute and compare other measures of reliability for Data Detectives, it is initially expected that the area under the ROC curve will be the critical measure for accuracy.
Click on image to enlarge.
A critical goal is to have accurate information. Therefore, each edge has a credibility score, which is a number from 0 ("myth") to 1 ("concrete"). Viewers who visit thedatamap.org can select a map based on a credibility threshold. The credibility of an edge is computed from the weighted votes of Data Detectives, where the weights are the classifier scores computed for each Data Detective (described above).
The system will archive the contents of all reported URLs near the time a Data Detective first provides the Report that uses the URL. During the Vetting Game, Data Detectives are expected to use their own browser to investigate the contents of the URL at the time of vetting. The archived copy is used to resolve disputes
Once provided, Reports cannot be removed. Given that anyone can generate a Report, it is, of course, possible for biased, out-of-date, or incorrect information to appear. However, because there are so many other people vetting Reports, incorrect information should be rejected and a history maintained of what was debunked even though the history may not be publicly available. Thus, the overall accuracy of theDataMap™ improves all the time.
Data Detectives may review content on theDataMap and find some data sharing arrangements more interesting than others. To help identify these, a Data Detective may assign an interest factor of 1, 2, or 3 "thumbs up" to any data sharing report regardless of whether he provided the Report or voted on it. It is likely that some Data Detectives will just provide interest factors and not engage in logging and vetting reports.
Finding the best incentives for participation is part of the research effort, but as a starting basis, Patient Privacy Rights Foundation will provide the following rewards:
Copyright © 2012 President and Fellows Harvard University.