UpGuard can now report that a public Google Cloud Storage bucket containing approximately 1.5 terabytes of data used to administer funding programs for college students has been secured. The bucket belonged to SmarterSelect, a company that provides software for managing the application process for scholarships, grants, and awards. The more than 2.8 million files included documents like transcripts, resumes, personal essays, tax returns, and invoices for approximately 1.2 million applications to funding programs.
Discovery
On September 8 an UpGuard analyst detected the bucket hosted in Google Cloud Storage and began analysing the contents to determine the owner and affected entities. On September 15, UpGuard sent a breach notification email to the address listed on SmarterSelect’s privacy policy. On September 27, another notification was submitted through the support function on SmarterSelect’s site. On September 30, SmarterSelect replied. By October 5, public access to the bucket was removed.
Significance
The contents of the bucket were organized into nine top level directories. The directories with logos gave an indication of the number of organizations involved in this data exposure. The directories for "provider_logos" and "fund_logos" contained logos for a few hundred entities each, while “scholarship_logo” had over 15,000 logos. Two other directories contained the files with significant personal data: “file_attachment_file” and “exports,” which contained original files submitted as part of applications and exported summaries of applications, respectively.
The “exports” folder contained approximately 23,000 CSVs and 8,000 ZIP files. Inside the ZIP files were about 150,000 PDFs with dates ranging between November 5th, 2020 to September 29th, 2021. The PDF files were mostly “printed” copies of submitted applications and evaluations. The CSV files were categorized as “user,” “apps,” and “evals,” and had data about user accounts, application contents, and the evaluations of those applications by reviewers. Some of the data personally identified the evaluators, including their names, email addresses, organizations, and evaluation comments and results, but most of the information pertained to the applicants. The structures of the CSVs varied based on the data requested by each application process, making a wide variety of data points available in some but not all files. Across all the CSVs there were 1.98 million unique email addresses.
For applicants, these files contained contact information like name, email address, and phone number, as well as details probing into their lives and backgrounds, like their parents’ education and income, the students’ performance at school, and personal experiences like living in a foster home or abusive situations.
In addition to the structured data, some files also contained the text of longer documents that had been submitted and reviewed. These included intensely revealing statements like letters of recommendation and personal essays detailing poverty, physical and sexual abuse, domestic violence, and other personal information.
Another directory named “file_attachment_file” had 2.79 million files, the vast majority of the total collection. These files were organized into 1.2 million subfolders, each of which contained the original files submitted by an applicant for a given funding opportunity, often PDF and .DOCX files of applicants’ transcripts, letters of recommendation, and other academic documentation. Student photos were included when required as part of the application. Additionally, some applications included documents related to the applicant’s financial status, like FAFSA forms, which included the last four digits of the person’s social security number, and personal and parental tax returns, which included full social security numbers.
Manual review of a sampling of documents indicated they were largely the kinds of documents commonly used to apply for scholarships: transcripts, personal statement, letters of recommendation, and other documentation of university status. Searching across the names of files in “file_attachments” gave a conservative indication of the number of files of each type. (Many more documents were simply titled with the applicant’s name and could not be classified based on file name.) Other documents answered the specific requirements of particular programs and included information like proof of COVID-19 vaccinations and descriptions of hardships.
Conclusion
While Amazon S3 buckets are more well known for their history of data exposures due to cloud misconfiguration, Google Cloud Storage has fundamentally the same configuration options. Like S3, Google Cloud Storage includes a UI element in the console that indicates when buckets or files are public to help users avoid data exposures, but misconfigurations still happen. In this case, the bucket and its contents were configured to be publicly accessible.
The contents of the bucket also serve as a reminder of the risks of collecting and retaining sensitive data, particularly for populations like college students. The process of applying to, attending, and securing funding for university education requires young people to provide detailed information about themselves to a complex institutional supply chain. Even well-intentioned programs aiming to assist students who have been disadvantaged by circumstances beyond their control– in fact, especially those programs that seek to help those most in need– require a detailed accounting of the facts of one’s life.
Where all that data ultimately goes, how it is secured, and whether it is ever destroyed is not under the control of the applicants. For companies, non-profits, and universities that make up that digital supply chain, destroying that information once it is no longer needed may provide a more foolproof path to privacy than retaining and protecting it. As data exposures like this one continue to occur, and ransomware groups target schools across the world, minimizing cybersecurity risk continues to be an important part of an overall security strategy.