Tracking the Trackers: Where Everybody Knows Your Username
Click the local Home Depot ad and your email address gets handed to a dozen companies monitoring you. Your web browsing, past, present, and future, is now associated with your identity. Swap photos with friends on Photobucket and clue a couple dozen more into your username. Keep tabs on your favorite teams with Bleacher Report and you pass your full name to a dozen again. This isn't a 1984-esque scaremongering hypothetical. This is what's happening today.
Background on Third-Party Web Tracking and Anonymity
In a post on the Stanford CIS blog two months ago, Arvind Narayanan explained how third-party web tracking is not at all anonymous.
In the language of computer science, clickstreams – browsing histories that companies collect – are not anonymous at all; rather, they are pseudonymous. The latter term is not only more technically appropriate, it is much more reflective of the fact that at any point after the data has been collected, the tracking company might try to attach an identity to the pseudonym (unique ID) that your data is labeled with. Thus, identification of a user affects not only future tracking, but also retroactively affects the data that's already been collected. Identification needs to happen only once, ever, per user.
Arvind noted five ways in which a user's identity may be associated with third-party web tracking data.
A third party is also a first party, e.g. Facebook, Twitter, or Google+.
A first party hands off ("leaks") identifying information to a third party.
A third party buys identifying information from a "matching service."
A third party exploits a security vulnerability to learn a user's identity.
A third party "deanonymizes" its data by matching it against identified data.
This post is an empirical study of identifying information leakage from first-party websites to third-party websites.
Web Information Leakage
Leakage most often occurs when a first-party website stuffs information into a URL. For example, suppose Example Website sends users after they register to:
http://example.com/register?
username=GoCardinal
&name=Leland%20Stanford
&email=leland%40stanford.edu
&...
username=GoCardinal
&name=Leland%20Stanford
&email=leland%40stanford.edu
&...
Third parties embedded in the page will receive the URL in a referrer header or equivalent – and therefore Leland Stanford's username, name, and email.
Another common form of leakage is through the page title. Suppose a website's landing page includes a title tag of:
Welcome, Leland Stanford!
Embedded third-party scripts often report back with the page title; in this case, they'd include Leland Stanford's name.
Leakage, in common parlance, implies unintentionality. In computer security, leakage is a term of art for an information flow – some instances of leakage are entirely intentional. For example, OkCupid, a free online dating website, appears to sell user information to the data providers BlueKai and Lotame, including gender, age, ZIP code, relationship status, and drug use frequency.
In a series of groundbreaking studies Balachander Krishnamurthy, Craig Wills, and Konstantin Naryshkin have demonstrated that information leakage is a pervasive problem (1, 2, 3). In their most recent paper, the authors examined signup and interaction with 120 popular sites for information leakage to third parties. They found that 56% leaked some form of private information, and 48% leaked a user identifier.
We roughly followed the same methodology as Krishnamurthy, Wills, and Naryshkin, with 1) a focus on identifying information leakage, 2) a greater number of sites, 3) and a public dataset.
Usernames as Identifying Information
Given the sizeable role usernames play in web information leakage, it's worth taking a moment to note how a username is identifying information. In some cases a username is just a user's name – for example, @jonathanmayer on Twitter. Even when it isn't the user's name, a username is often more than adequate for identifying a user.
First, a username is likely sufficient to link accounts across websites. Users routinely reuse their usernames – after all, who's going to remember a new login for each site they use? In a paper at PETS 2011, Daniele Perito et al. examined a sample of public data from Google, eBay, and other sites to estimate how linkable usernames are. They found that the vast majority of usernames in their sample had high entropy, and that simple algorithms for linking usernames could achieve pairwise precision and recall of over 70%. (For further discussion of using usernames to link social profiles, see Arvind's blog posts "The Linkability of Usernames" and "Lendingclub.com: A De-anonymization Walkthrough," as well as "Modeling Unintended Personal-Information Leakage from Multiple Online Social Networks" and "Large Online Social Footprints - An Emerging Threat" by Danesh Irani et al.) Some companies are already linking usernames in their products, including social matching services (e.g. Infochimps), scraped profiles (e.g. Spokeo), and automated social network linkage (e.g. Google Social Search).
Second, combining data from multiple accounts often provides a sufficiently comprehensive mosaic to identify an individual. Arvind, for example, usually goes by the username "randomwalker." The first page of a Google search turned up his yCombinator Hacker News account, which includes his job and links to his personal website, blog, and Twitter account.
Some websites (e.g. Quantcast) have responsibly recognized that a username is identifying information and have included username in their legal definition of "personally identifiable information" (PII).
Methodology
We examined each website in the Quantcast top 250, checking for whether it
offered a sign up,
did not require a purchase or other qualification to sign up, and
did not include so many features as to be impractical for study.
For each of the 185 websites that met all three criteria, we used the FourthParty web measurement platform to create an account and interact with the site. We emphasized exploring content that dealt with a user's identity, such as profile and settings pages. After collecting data, we searched Request-URIs and Referrer headers for known personal information. We treated each public suffix + 1 (PS+1) as an independent entity, and we considered any PS+1 different from a first party's to be a third party.
Results
A complete spreadsheet of results is available in Excel format. We encourage interested readers to examine the results for themselves. Please email if you would like FourthParty logs for a specific site.
The most frequent type of leakage was a username or user ID. We identified username or user ID leakage to a third party on 113 websites, 61% of the websites in our sample. The top five PS+1 recipients of username and user ID leakage were:
- scorecardresearch.com (comScore), on 81 (44%) of the websites in our sample
- google-analytics.com (Google Analytics), on 78 (42%) of the websites in our sample
- quantserve.com (Quantcast), on 63 (34%) of the websites in our sample
- doubleclick.net (Google Advertising), on 62 (34%) of the websites in our sample
- facebook.com (Facebook), on 45 (24%) of the websites in our sample
Some websites leaked the username or user ID to dozens of third parties. For example, popular photo sharing website Photobucket embeds username in many of its URLs, and includes advertising on most of its pages; we observed the username get sent to 31 third-party PS+1s.
Other identifying information leaked in a number of instances. A sample:
Viewing a local ad on the Home Depot website sent the user's first name and email address to 13 companies.
Entering the wrong password on the Wall Street Journal website sent the user's email address to 7 companies.
Changing user settings on the video sharing site Metacafe sent first name, last name, birthday, email address, physical address, and phone numbers to 2 companies.
Signing up on the NBC website sent the user's email address to 7 companies.
Signing up on Weather Underground sent the user's email address to 22 companies.
The mandatory mailing list page during CNBC signup sent the user's email address to 2 companies.
Clicking the validation link in the Reuters signup email sent the user's email address to 5 companies.
Interacting with Bleacher Report sent the user's first and last names to 15 companies.
Interacting with classmates.com sent the user's first and last names to 22 companies.
Implications
From a legal perspective, identifying information leakage is a debacle. Many first-party websites make what would appear to be incorrect, or at minimum misleading, representations about not sharing PII. Here are some examples.
Personal Information Disclosure: The Home Depot will not trade, rent or sell your personal information, without your prior consent, except as otherwise set out herein. [Does not describe sharing with third-parties for advertising or analytics.]
We will not sell, rent, or share your Personal Information with these third parties for such parties' own marketing purposes, unless you choose in advance to have your Personal Information shared for this purpose. Information about your activities on our Online Services and other non-personally identifiable information about you may be used to limit the online ads you encounter to those we believe are consistent with your interests. Third-party advertising networks and advertisers may also use cookies and similar technologies to collect and track non-personally identifiable information such as demographic information, aggregated information, and Internet activity to assist them in delivering advertising on our Online Services that is more relevant to your interests.
Metacafe's Privacy Policy is to share personal information only with the owner's informed consent.
Likewise, a number of third-party trackers disclaim collection of personally identifiable information
Does your beacon collect or store any personally identifiable information about me?
The tagging used by ScorecardResearch is unable to identify the user visiting a page.
The tagging used by ScorecardResearch is unable to identify the user visiting a page.
We do not tie the information gathered by Quantcast Tags to the personally identifiable information of visitors to a Web site.
. . .
We do not link Log Data to any other Personally Identifiable Information about you or otherwise attempt to discover your identity.
. . .
We do not link Log Data to any other Personally Identifiable Information about you or otherwise attempt to discover your identity.
We don't collect or serve ads based on personally identifying information without your permission.
The better practice for all first-party and third-party websites would be to acknowledge that identifying information leakage is a fact of life on the web, and that identifying information may be shared with third parties.
As for policy, some strands of the Do Not Track debate echo a sentiment of "it's all anonymous," and so, "where's the harm?" We believe there is now overwhelming evidence that third-party web tracking is not anonymous. It is a legitimate policy question whether, on balance, Do Not Track should be enforced by law. But the difficult weighing of competing privacy risks and economics can't be short-circuited by claims of anonymity
No comments:
Post a Comment