Data classification is the process of organizing structured and unstructured data into categories, so it can be used and secured more efficiently. It makes data easier to locate and retrieve while facilitating better risk management, legal discovery, and regulatory compliance.
Data classification involves labeling sensitive data and personal information to make it searchable and trackable. This had the added benefit of eliminating duplicate data, reducing storage and backup costs, and helping minimize cybersecurity risk.
A written data classification policy will generally outline procedures and guidelines about what classification levels your organization uses to group data, as well as the specific roles and responsibilities of employees who act as data stewards.
Once a data classification scheme has been created and data has been grouped, you must determine the appropriate security standards for each category across its lifecycle.
While creating a data classification policy may sound technically difficult, all levels of your organization need to understand their role.
Why is Data Classification Important?
Data classification is important because you have limited resources that you can invest in safeguarding your sensitive data. Knowing what types of data need protection means you can set priorities and allocate your budget and other resources to the most high-impact areas, minimizing data security and compliance costs.
Additionally, data classification can help you comply with different regulations in and outside of the United States including the Family Educational Rights and Privacy Act (FERPA), PCI-DSS, HIPAA, CPS 234, ITAR, and many others.
What are the Essentials of Effective Data Classification?
Effective data classification requires an understanding of the following concepts:
- Data states: Data exists in three states (at rest, in progress, and in transit). Regardless of state, sensitive information should remain encrypted and confidential.
- Data format: Data can be structured and unstructured. Classifying structured data is generally less complex, time-consuming, and cheaper than unstructured data.
- Structured data: Readable and can be indexed, such as database objects and spreadsheets
- Unstructured data: Not human readable nor indexable, such as source code, documents, and binaries.
- Data discovery: Before data can be classified, you need to know its location, volume, and context. This is true regardless of where it is hosted (on-premise, in the cloud, in legacy databases or with a service provider).
- Data sensitivity: To help with prioritization, data is generally classified into sensitivity levels.
- High sensitivity data: Data is said to be high sensitivity if exposure or destruction would result in a catastrophic impact on your organization or customers. Common examples of high sensitivity data are credit card numbers, social security numbers, driver's license numbers, protected health information (PHI), personally identifiable information (PII), cardholder information, intellectual property, business processes, biometrics, and bank account numbers.
- Medium sensitivity data: Data designed for internal use where unauthorized disclosure would not have a catastrophic impact on your organization or customers. Examples include emails and documents that do not include confidential data.
- Low sensitivity data: Data designed to be public information. Examples include press releases, marketing material, website content, and other public data.
- Regulatory requirements: With the increasing number of data protection laws around the world, data classification is increasingly becoming a regulatory requirement. For example, the EU's General Data Protection Regulation (GDPR) calls for all personal data to be protected, as does PIPEDA, FIPA, the SHIELD Act, and LGPD.
- Industry-specific mandates: Alongside regulation, many industry-specific mandates now require the classification of different data attributes. For example, the Cloud Security Alliance (CSA) requires data and data objects to include data type, the jurisdiction of origin and domicile, context, legal constraints, and sensitivity. Read more about the Cloud Security Alliance and the CAIQ here.
What is the Purpose of Data Classification?
Beyond making data easier to locate, retrieve, manipulate, and track, a well-planned data classification information system improves data security and regulatory compliance.
The true purpose of data classification is to safeguard sensitive corporate and customer data. To do this, you must be able to answer the following questions:
- What sensitive data do I store?
- Where does this sensitive data reside?
- Who can access, modify, and delete it?
- How will my business be affected if this data is leaked, destroyed, or altered?
Along with answering these questions, you also need to understand your regulatory requirements and what specific data you must protect. Examples include cardholder data (PCI DSS), health records (HIPAA), financial data (SOX) or personal data (GDPR, LGPD, PIPEDA, The SHIELD Act, and FIPA).
In addition to these questions, many data classification tools help you protect the confidentiality, integrity, and availability (CIA triad) of sensitive or important data sets:
- Confidentiality: Data classification can help you understand the type of data that needs to be protected and an appropriate level of security for each category.
- Integrity: By making it easier to track individual data elements, data classification can ensure integrity is upheld and any unauthorized changes are detected.
- Availability: For important but unclassified data, data classification can help you focus on ensuring it is easily accessible to end-users, customers, and service providers.
What are the Typical Types of Data Classification Categories?
Data classification generally involves a multitude of tags and labels that define the type of data, its sensitivity, as well as confidentiality, integrity, and availability requirements that are unique to your organization.
This means your data classification categories will largely depend on your information security policy, regulatory requirements, and risk appetite.
As a starting point, consider using a simple three-tiered approach:
- Public data: Data that is freely disclosed to the public, such as customer service email addresses and phone numbers.
- Internal data: Data that has minimal security requirements but is not intended for public disclosure. Examples include marketing research, directory information, and sales phone scripts.
- Restricted data: Highly sensitive internal data whose disclosure could negatively affect operations or put you or your customers at financial or legal risk. This requires the highest level of security protection.
What is the Data Classification Process?
There is no one size fits all approach to data classification. Nor does all data need to be classified, some may even be better destroyed. That said, we can break down a general process you can tailor for your unique needs, desires, and regulatory requirements.
The first step is to define a data classification policy. The policy should be communicated to all employees who have access to sensitive data and include the following elements:
- Objectives: The reasons data classification has been put in place and the goals you expect to achieve from it.
- Processes: Outlines how the data classification process will be organized, and how will it impact employees who use different types of sensitive data.
- Data classification scheme: The categories the data will be put into.
- Data owners: Describes who is directly responsible for which types of data, including how it is classified and who is granted access to it.
- Handling instructions: Security standards that specify how sensitive data will be secured for each category of data, who will have access to it, how it will be shared, and how long it will be retained for.
After the policy has been finalized, you need to run a data discovery process to determine the location, volume, and context of data that is hosted on-premises, in the cloud, in legacy databases, and with third-party vendors.
You may choose to skip this step and only classify new data. This would, however, leave current business-critical or confidential data insufficiently protected.
The data discovery process can be a manual process searching for databases, file shares, and other systems that may contain sensitive information or via a data discovery application that searches for sensitive information via metadata and other tags that group information into different groups quickly.
Regardless of whether you discover data by hand or with software, the next step is to categorize it. The categorization method can generally be broken down into three groups:
- Content-based classification: Inspects and interprets files to determine if it contains sensitive information.
- Context-based classification: Looks at application, location, or creator among other variables as indirect indicators of sensitive information.
- User-based classification: It depends on a manual, end-user selection process for each document. This relies on user knowledge and discretion at creation, edit, review, and dissemination stages.
After the data discovery process, sensitive data should be labeled following your data classification schema.
Once all the data is labeled, it's time to implement appropriate security controls to protect the sensitive data based on what your data classification policy outlines. This could include encryption, access control, the principle of least privilege, data leak detection tools, user monitoring, separation of duties and many other security controls.
Remember this is a continuous process. Files are created, moved, and deleted constantly.