While tech-savvy people are very concerned about privacy, knowing where to find metadata leaks can be nebulous even for developers. In this blog post, we will explore examples of unexpected user information leakage. We hope that the information shared in this blog will help developers assess and address potential privacy issues with their applications, as well as educate end-users about potential risks to their privacy that can result from information leaks.
We picked six examples based on design flaws that are often overlooked. We recognize that common vulnerabilities such as SQL injection and memory corruption often lead to confidentiality and privacy issues as well. We feel that these issues also pose a significant risk to privacy and can compromise personal data if not addressed properly.
#1 – Gravatar: Weak Encoding of Private Information
The Gravatar service is a centralized avatar service where users upload a personal image, or avatar, that will be used by all services integrating with Gravatar. It removes the burden from the website to handle image upload and manage the user’s avatars. From a performance and security perspective, this can be highly beneficial but in return, the identity of users can be exposed.
Gravatar integration on a blog |
MD5("[email protected]") => 06b856f7ee1266fbf86eaa018f5b0906
https://secure.gravatar.com/avatar/06b856f7ee1266fbf86eaa018f5b0906?s=96&d=identicon&r=G
Since 2007, Gravatar’s use has spread to hundreds of millions of websites. Being so widespread, it is easy to forget that it can negatively affect users. To demonstrate this, we sampled 91 million Gravatar hashes from the public WordPress.com API. From those, 60% were reversed easily into email form. To crack those hashes, we used some decent GPUs and hashcat. At this point, we disagree with Gravatar’s statement that no email addresses were exposed: A cryptographic hash function like MD5 applied to an email address does not provide sufficient protection to keep it private when it is exposed publicly as part of an API.
What is the risk for your users?
First, there are privacy concerns with the disclosure of user emails. A user might want to remain anonymous when commenting or using specific websites. Some organizations might need to protect their employees to avoid attacks targeting them.
Email disclosure of all users can also greatly improve brute force attack success rates. Since we live in an era of periodic password breaches, single-factor authentication is more vulnerable than ever to credential stuffing attacks. An attacker can correlate the list of a specific user’s email addresses against known passwords found in breaches.
Emails can serve as pivot points for effective credential stuffing |
Mitigations
- The first and most important mitigation would be to avoid using Gravatar if your users’ anonymity is required.
- If you want to make it much harder to deanonymize your users, you can download the image server-side rather than pointing to the Gravatar service. Image correlation might still be possible, but it will be much more time-consuming than targeting a specific hash. (See next section)
- As a complement to this, the implementation of a Multiple-Factor Authentication is your best ally to defend against credential stuffing.
#2 – Facebook Login and Profile Picture: Correlation
Like Gravatar, using Facebook’s profile photo provided by the Facebook API can lead to the deanonymization of your users.
When integrating Facebook Login, it is tempting for developers to use available properties to complete the profile of a newly onboarded user. Once the user is authenticated, it is possible to obtain the public profile picture from this user by using a temporary or permanent graph URL.
https://graph.facebook.com/<USER_ID>/picture?type=large
We can’t query the graph API with this user id unless we have specially requested authorization from the user. We can, however, correlate the associated Facebook profile by comparing the picture content.
As an example, we pick one website using Facebook Login to authenticate its users: Strava. Strava – a fitness tracker application – is making great efforts to provide different levels of anonymity. Here is a user who created their account with Facebook Login. This can be proven by the presence of the graph URL:
https://graph.facebook.com/3096958013955938/picture?height=256&width=256
Here is the same profile picture when visiting their Facebook profile directly: a different URL is used to render the user photo:
https://xxxx.fbcdn.net/v/t1.18169-1/11034273_1385063198478770_5760666877135868228_n.jpg
Although a different URL is used to render the user photo, if we compare both images’ content, we can see that both images are identical:
Comparer view in Burp showing that the content is identical |
What does this mean for an attacker?
A motivated attacker could index the signature of the picture for the different profiles they target. From this database, a quick lookup could reveal the Facebook profile associated with any profile with a Graph picture reference. The Graph URL will have to be adjusted to return an image that matches the database. (For example, the query string can be changed from ?height=256&width=256 to ?height=200&width=200.)
Potential defenses to reduce the risk
Re-encoding the image can increase the effort for a malicious actor trying to index profile images. Alternative methods could be used to match images. Exact pixel matching would, however, be prone to produce a false-negative due to compression artifacts and Convolutional Neural Network (CNN) would generate many false positives.
#3 – Metadata Embedded in Various File Format
Documents and files can contain small bits of information that can be helpful to profile their author. Metadata in images can reveal additional information like the software version and local paths (some image generators will include those). More importantly, most photos taken with a cellphone will include GPS coordinates. It is the website’s duty to strip that information out of the images they expect to be shared publicly.
Additionally, images downloaded from Facebook can possibly be linked to their source because of a unique ID present in the metadata. This ID has the format FBMD_____.
More than just images
While images are the prime targets for metadata, it is easy to forget that many file formats (ie. Word documents, PDFs, etc.) can also contain metadata. Let’s look at email disclosure in a less common document type: PGP public keys.
Keybase is a website that aims to provide a profile aggregation with mechanisms to validate and connect the user’s identity to external profiles. It is also a platform where users can chat using end-to-end encrypted communication. It also encourages the use of PGP for email exchange. Therefore, a unique pair of keys is generated for all users during their onboarding.
Here is a profile example on Keybase.io:
What the user sees on their profile |
Information available from the REST endpoint user/lookup |
User ID present in OpenPGP format |
Recommendations
To mitigate, you must first identify documents or files that are likely to contain metadata (eg. Docx, pptx, odt, ods, pgp, pdf). Then, a specific procedure should be applied to each document type to remove fields that could contain private information.
In the case of images, you can strip metadata by re-encoding the image. This process should obviously not be applied if your application is a storage hosting provider. In this case, you can potentially warn your users about the presence of specific metadata before they use sharing features.
#4 – Search Features Matching on Private Fields
Search features can be a source of information leakage. Confidential information might not be shown on your application but could be queried through search features. In the example below, the application is never going to show the address and phone number of the user. The application is, however, allowing those fields to be queried.
Here is a simple example of information disclosure through search. In SonarSource, a popular code analysis software, there is a public API to search users. The API’s documentation describes the email and other fields as private (see image below).
However, it is possible to brute force the email address of all users because it matches all fields from the entity, not just those being displayed.
The search with an email returns the associated user. |
Doubt the documentation
We have also stumbled upon a few API endpoints where the same principle applies. In one case, the application supports an endpoint with the URL pattern “api/users/<USER_ID>“. While the documentation gave examples with a user ID described as GUID, the field also supported username and email as identifiers. With some APIs, users can even be retrieved from a partial email address search. That said, even if it requires a full email match, an attacker could still feed in a large list of email addresses to reveal many user accounts registered on the system with their associated emails.
#5 – Verbose REST or GraphQL Endpoint
REST API and GraphQL are widely popular in dynamic applications making use of AJAX. It is important for developers to remember to show only what is needed by the front end. The code below demonstrates a simple API that reuses the same class for persistence and its public API. It might have been fine at the API inception, but over time new fields might be added to the UserEntity class. If that happens, fields such as physical address and IP address are now visible on the client-side.
@Entity public class UserEntity { @Id private Long id; private String username; private String fullName; private String address; private String ip; […] @Controller public class UsersController { @RequestMapping("/users") public UserEntity getU(@RequestParam(“id") String id) { return … } […]
To avoid such problems, it is important to keep a separation between the public API class – often called Data Transfer Object (DTO) – and the persistence model. While it seems like an unnecessary conceptual duplication, this same design choice can also help stabilize the API if the model is having frequent evolution.
#6 – Side-Channels
A side-channel attack is based on information gained from the implementation. Data is sent to trigger a specific behavior and another channel can provide an extra source of information. Here, we are referring to channels as different systems or protocols.
These attacks usually refer to physical attacks on hardware. It might be stretch to call the following scenarios as such.
Here is an example of side-channels observed on most Git hosting providers (eg. GitHub, Gitlab and BitBucket).
GitHub: Enriching Git commits
If an attacker wants to deanonymize multiple corporate emails, they will be interested in obtaining the full name of the owners of those emails and their personal email, if possible.
The deanonymization process can be performed in two simple steps:
- The attacker imports a git repository with commits containing spoofed emails (emails to deanonymize).
- Once imported, the attacker views those same commits through the web interface to see if the identities were enriched with additional information. (username, full name, organization)
Commit imported seen via the Git logs:
$ git log commit a152c6fc3ac19ebf0afe4b2da3e9414ef2c303fa (HEAD -> master, origin/master) Author: unknown <[email protected]> Date: Wed Mar 2 02:05:57 2022 -0500 Test commit
Account information associated with the email used in the commit |
SMTP: Enriching Emails Sender
We have seen multiple webmail services subject to this same design issue. You could import a series of crafted emails with email addresses to deanonymize as sources (ie. From: attribute) and view the author of the mail through the web interface. However, most of those email providers have simpler ways to deanonymize emails.
For Gmail or Google accounts, a tool named GHunt returns additional metadata based on any Google account email found. For Outlook/LinkedIn, rather than using the SMTP channel, we can use LinkedIn contacts import feature and the WebSocket API for outlook card as it provides a direct way to get the information.
Threat Modeling
Evaluating the risk of small information leakage can often be subtle. For this reason, a dedicated threat modeling exercise might be needed to properly evaluate the privacy implications of your APIs.
Here are three things you can identify in your next threat modeling exercise:
- APIs that could return sensitive information
- APIs that could be queried with sensitive information
- Third party integrations (ie: Facebook Login, Google Account, Gravatar, etc.) and their privacy implications
Conclusion
With the information provided, we are hopeful that developers can address some key challenges to privacy through improvements to their API designs. In addition, threat modeling exercises can help with the ongoing evaluation of the risks associated with applications.
This blog can also be beneficial to end-users. ‘Forewarned is forearmed’ – as the saying goes. We hope that a warning like this will result in end-users taking more precautions. As a user, you can reconsider how you share information publicly, such as your location, affiliations or even full name knowing that your Facebook account will be linked to multiple websites. Keep in mind that what you publish is likely to be linked with additional information on other websites integrating with third parties such as Facebook Login.
And be sure to check this blog regularly for research, threat intelligence and security updates from the team at GoSecure Titan Labs. You can also follow GoSecure on Twitter and LinkedIn.
References
- Gravatar public response to email disclosure: https://en.gravatar.com/support/data-privacy
- Email disclosure on WordPress.com and its impact: https://www.gosecure.net/blog/2021/03/02/emails-disclosure-on-wordpress/
- Graph API documentation: https://developers.facebook.com/docs/graph-api/overview/
- A presentation at Confoo about Privacy Pitfalls for Developers: https://gosecure.github.io/presentations/2022-02-25-confoo-privacy/Privacy_pitfalls_for_your_web_application.pdf