PII Handling: Protecting Personal Data in Data Systems
Learn techniques for identifying, protecting, and managing personally identifiable information across your data platform.
PII Handling: Protecting Personal Data in Data Systems
Personal Identifiable Information (PII) is any data that can identify a specific individual. Names, email addresses, phone numbers, social security numbers, IP addresses, biometric data. In an era of increasing privacy regulation and data breaches, handling PII correctly is both a legal obligation and a trust imperative.
Getting PII handling right requires understanding what PII is, where it lives in your systems, how to protect it, and what controls to put around it. This is not just a compliance exercise. It is a fundamental part of building data systems that people can trust.
What Counts as PII
PII is broadly defined. It is any information that, alone or combined with other data, can identify an individual.
Direct Identifiers
Direct identifiers uniquely identify a specific person. These are obviously PII:
- Full name
- Social security number
- Passport number
- Driver’s license number
- Email address
- Phone number
Indirect Identifiers
Indirect identifiers can identify an individual when combined with other information:
- Date of birth
- Zip code
- Gender
- Race or ethnicity
- Job title and employer
- Geographic indicators
A combination of date of birth, zip code, and gender can identify a surprisingly large percentage of individuals.
Sensitive Personal Data
Some PII requires additional protections due to its sensitivity:
- Medical/health information
- Financial account information
- Biometric data
- Genetic data
- Precise location data
- Sexual orientation
- Religious beliefs
This data is sometimes called “special category” data under regulations like GDPR.
When to Use Each Protection Technique
| Technique | When to Use | Key Risk |
|---|---|---|
| Encryption | Data at rest in storage, PII that must be recoverable | Key management complexity, key loss = data loss |
| Masking | Non-production environments, internal reports, analytical views | Masked data is not usable for joins or lookups |
| Tokenization | When original values needed downstream but must be protected in storage | Token vault is a single point of failure |
| Pseudonymization | When you need to track records across datasets while preserving identity separability | Reversible with the salt — salt compromise breaks pseudonymity |
| Deletion | When retention period expires, user exercises right to be forgotten | Cannot recover deleted data; ensure backup policies align |
Apply the minimum protection needed. Not every PII field requires encryption—masking may be sufficient for internal analytics.
Classifying PII in Your Data
The first step is knowing what PII you have. This requires classifying data across your platform.
-- PII classification registry
CREATE TABLE pii_classification (
dataset_id VARCHAR(100) NOT NULL,
column_name VARCHAR(100) NOT NULL,
pii_type VARCHAR(50) NOT NULL, -- DIRECT, INDIRECT, SENSITIVE, NON_PII
pii_category VARCHAR(100), -- NAME, EMAIL, PHONE, SSN, etc.
protection_level VARCHAR(20) NOT NULL, -- RESTRICTED, CONFIDENTIAL, INTERNAL
retention_period_months INT,
last_reviewed_date DATE,
PRIMARY KEY (dataset_id, column_name)
);
-- Example classifications
INSERT INTO pii_classification (dataset_id, column_name, pii_type, pii_category, protection_level) VALUES
('warehouse.dim_customer', 'email', 'DIRECT', 'EMAIL', 'CONFIDENTIAL'),
('warehouse.dim_customer', 'phone', 'DIRECT', 'PHONE', 'CONFIDENTIAL'),
('warehouse.dim_customer', 'date_of_birth', 'INDIRECT', 'DOB', 'INTERNAL'),
('warehouse.dim_customer', 'city', 'INDIRECT', 'LOCATION', 'INTERNAL'),
('warehouse.fact_orders', 'order_id', 'NON_PII', 'ORDER_ID', 'PUBLIC'),
('warehouse.fact_orders', 'total_amount', 'NON_PII', 'AMOUNT', 'PUBLIC');
Techniques for PII Protection
Data Masking
Masking replaces sensitive data with a non-sensitive equivalent. The data is not reversible in most forms.
def mask_pii(value, pii_type):
"""Mask PII based on its type."""
if value is None:
return None
if pii_type == 'EMAIL':
# john.doe@example.com -> j***e@example.com
parts = value.split('@')
if len(parts) == 2:
name = parts[0]
domain = parts[1]
masked_name = name[0] + '***' + name[-1] if len(name) > 2 else '***'
return f"{masked_name}@{domain}"
return '***@***'
elif pii_type == 'PHONE':
# (555) 123-4567 -> (***) ***-4567
return '***-***-' + value[-4:] if len(value) >= 4 else '***'
elif pii_type == 'SSN':
# 123-45-6789 -> ***-**-6789
return '***-**-' + value[-4:] if len(value) >= 4 else '***'
elif pii_type == 'NAME':
# John Doe -> J*** D***
parts = value.split(' ')
masked = ' '.join(p[0] + '***' for p in parts)
return masked
return '***'
Tokenization
Tokenization replaces sensitive data with a non-sensitive token. The token maps back to the original value in a secure token vault.
import hashlib
import secrets
class TokenVault:
"""Secure token vault for PII tokenization."""
def __init__(self):
self.tokens = {} # token -> original value
self.reverse = {} # hash(original) -> token
def tokenize(self, value, pii_type):
"""Replace PII with a token."""
# Check if already tokenized
value_hash = hashlib.sha256(value.encode()).hexdigest()
if value_hash in self.reverse:
return self.reverse[value_hash]
# Generate new token
token = secrets.token_hex(16)
self.tokens[token] = {
'original': value,
'pii_type': pii_type,
'created_at': datetime.now()
}
self.reverse[value_hash] = token
return token
def detokenize(self, token):
"""Retrieve original value from token."""
return self.tokens.get(token, {}).get('original')
# Usage in ETL
def tokenize_pii_columns(df, token_vault):
"""Replace PII values with tokens."""
tokenized = df.copy()
for col in df.columns:
pii_info = get_pii_classification(col)
if pii_info and pii_info['pii_type'] != 'NON_PII':
tokenized[col] = df[col].apply(
lambda x: token_vault.tokenize(str(x), pii_info['pii_type'])
)
return tokenized
Pseudonymization
Pseudonymization replaces identifying fields with artificial identifiers, but unlike tokenization, it is reversible if you have access to the mapping. This is a key concept in GDPR, which distinguishes between pseudonymized data (still personal data, but protected) and anonymized data (no longer personal data).
def pseudonymize(df, identifying_columns, salt):
"""Replace identifying fields with pseudonyms using a salt."""
pseudonymized = df.copy()
for col in identifying_columns:
# Create deterministic pseudonym based on salt
def pseudonymize_value(value):
if value is None:
return None
raw = f"{salt}-{value}"
return hashlib.sha256(raw.encode()).hexdigest()[:16]
pseudonymized[col] = df[col].apply(pseudonymize_value)
return pseudonymized
Encryption
Encryption renders data unreadable without a decryption key. For PII at rest, encryption is essential.
from cryptography.fernet import Fernet
class PIIEncryption:
"""Encrypt and decrypt PII at the field level."""
def __init__(self, key=None):
if key is None:
key = Fernet.generate_key()
self.cipher = Fernet(key)
def encrypt(self, value):
"""Encrypt a PII value."""
if value is None:
return None
if isinstance(value, str):
value = value.encode()
return self.cipher.encrypt(value).decode()
def decrypt(self, encrypted_value):
"""Decrypt a PII value."""
if encrypted_value is None:
return None
return self.cipher.decrypt(encrypted_value.encode()).decode()
# Usage: encrypt PII before storing
def encrypt_pii_columns(df, encryption_service, columns_to_encrypt):
"""Encrypt specified PII columns."""
encrypted = df.copy()
for col in columns_to_encrypt:
encrypted[col] = df[col].apply(
lambda x: encryption_service.encrypt(str(x)) if x is not None else None
)
return encrypted
PII in Different Environments
flowchart TB
subgraph "PII Data Flow"
S[("Source System\nCRM, ERP, etc.")]
C[("Classification\nRegistry")]
P[("Protection Layer\nEncrypt / Mask / Tokenize")]
D[("De-identified\nData Store")]
A[("Access Control\nRBAC / Column Security")]
R[("Reporting /\nAnalytics")]
end
S --> C
C --> P
P --> D
D --> A
A --> R
Development and Testing
Never use real PII in non-production environments. Use synthetic data instead.
def generate_synthetic_pii(pii_type, count=100):
"""Generate synthetic PII for testing."""
synthetic_generators = {
'EMAIL': lambda: f"user{counter}@example.com",
'PHONE': lambda: f"+1-555-{random.randint(100,999)}-{random.randint(1000,9999)}",
'NAME': lambda: f"TestUser{random.randint(1, 1000)}",
'SSN': lambda: f"{random.randint(100,999)}-{random.randint(10,99)}-{random.randint(1000,9999)}",
'ADDRESS': lambda: f"{random.randint(1,9999)} Test Street",
'DOB': lambda: f"{random.randint(1950,2000)}-{random.randint(1,12):02d}-{random.randint(1,28):02d}"
}
generator = synthetic_generators.get(pii_type, lambda: f"SYNTHETIC_{pii_type}")
return [generator() for _ in range(count)]
Data Lake and Warehouse
In analytical environments, minimize PII exposure. Store only what is necessary for analysis.
-- Create a de-identified view of customer data
CREATE VIEW v_dim_customer_deidentified AS
SELECT
customer_key, -- Pseudonymized key, not the original ID
customer_name, -- Masked
CASE
WHEN date_of_birth IS NOT NULL
THEN DATE_FORMAT(DATE_SUB(date_of_birth, INTERVAL YEAR(CURRENT_DATE) - YEAR(date_of_birth) YEAR), '%m-%d')
-- Keep month-day only for birthday analysis
END AS birthday_md,
CASE
WHEN city IS NOT NULL AND country = 'US'
THEN city -- Keep city, it's not uniquely identifying
ELSE 'Other'
END AS city,
CASE
WHEN income_bracket IS NOT NULL
THEN income_bracket -- bucketed, not exact
ELSE 'Unknown'
END AS income_bracket
FROM dim_customer;
-- Never expose these in reporting views
-- original customer_id (only for joins with tokenized key)
-- full date_of_birth (too identifying)
-- exact address
-- SSN (never store in warehouse if avoidable)
Access Controls for PII
PII should be protected by access controls that limit who can see it.
Role-Based Access
-- Define roles with PII access
CREATE TABLE roles (
role_id INT PRIMARY KEY,
role_name VARCHAR(50),
pii_access_level VARCHAR(20) -- NONE, MASKED, FULL
);
-- Assign PII column access by role
CREATE TABLE pii_column_access (
role_id INT,
dataset_id VARCHAR(100),
column_name VARCHAR(100),
access_level VARCHAR(20), -- NONE, MASKED, FULL
FOREIGN KEY (role_id) REFERENCES roles(role_id)
);
-- Example: Analyst role gets masked access to most PII
INSERT INTO pii_column_access (role_id, dataset_id, column_name, access_level)
SELECT r.role_id, 'warehouse.dim_customer', 'email', 'MASKED'
FROM roles r WHERE r.role_name = 'analyst';
-- Example: Data engineer role gets full access for debugging
INSERT INTO pii_column_access (role_id, dataset_id, column_name, access_level)
SELECT r.role_id, 'warehouse.dim_customer', 'email', 'FULL'
FROM roles r WHERE r.role_name = 'data_engineer';
Column-Level Security Implementation
def apply_pii_access_controls(df, user_role, dataset_id):
"""Apply PII access controls based on user role."""
# Get access level for each column
access_levels = get_pii_access_levels(user_role, dataset_id)
controlled = df.copy()
for col, access_level in access_levels.items():
if col not in controlled.columns:
continue
if access_level == 'MASKED':
controlled[col] = controlled[col].apply(
lambda x: mask_pii(x, get_pii_type(col))
)
elif access_level == 'NONE':
# Remove column entirely
controlled = controlled.drop(columns=[col])
return controlled
Data Retention and Deletion
PII should not be kept longer than necessary. This is both a privacy principle and a legal requirement.
def apply_retention_policy(df, dataset_name, current_date):
"""Apply retention policy to dataset."""
retention = get_retention_policy(dataset_name)
if retention is None:
return df # No policy defined
retention_months = retention['retention_months']
date_column = retention.get('date_column', 'created_date')
cutoff_date = subtract_months(current_date, retention_months)
# Filter to keep only recent records
retained = df.filter(df[date_column] >= cutoff_date)
# Log deletion
deleted_count = len(df) - len(retained)
if deleted_count > 0:
log_data_deletion(
dataset=dataset_name,
deleted_count=deleted_count,
cutoff_date=cutoff_date,
retention_months=retention_months
)
return retained
The Right to be Forgotten
Under GDPR and similar regulations, individuals can request deletion of their data. Systems must be able to honor these requests.
def handle_deletion_request(user_id, dataset_ids):
"""Handle a data deletion request (right to be forgotten)."""
request_id = create_deletion_request(user_id)
for dataset_id in dataset_ids:
# Find records for this user
records = find_records_by_user(dataset_id, user_id)
# Delete or anonymize
if can_fully_delete(dataset_id):
hard_delete(dataset_id, user_id)
log_deletion(request_id, dataset_id, len(records), 'FULL')
else:
# Pseudonymize instead of deleting
pseudonymize_records(dataset_id, user_id)
log_deletion(request_id, dataset_id, len(records), 'PSEUDONYMIZED')
complete_deletion_request(request_id)
Monitoring PII Access
Track who accesses PII and when.
-- PII access audit log
CREATE TABLE pii_access_log (
log_id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id VARCHAR(100),
dataset_id VARCHAR(100),
column_name VARCHAR(100),
access_type VARCHAR(20), -- READ, EXPORT, DELETE
access_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ip_address VARCHAR(50),
session_id VARCHAR(100)
);
-- Trigger to log PII access
CREATE TRIGGER trg_pii_access_log
AFTER SELECT ON warehouse.dim_customer
FOR EACH ROW
BEGIN
INSERT INTO pii_access_log (user_id, dataset_id, access_type)
SELECT CURRENT_USER(), 'warehouse.dim_customer', 'READ';
END;
Compliance Frameworks
GDPR (European Union)
GDPR applies to EU residents’ personal data regardless of where your systems are located.
Key requirements:
- Lawful basis for processing (consent, contract, legitimate interest)
- Purpose limitation (only use data for stated purposes)
- Data minimization (only collect what is necessary)
- Accuracy (keep data accurate)
- Storage limitation (do not keep longer than necessary)
- Security (protect personal data)
- Accountability (demonstrate compliance)
CCPA (California)
CCPA gives California residents rights to know what personal data is collected, request deletion, and opt out of data sales.
Key requirements:
- Disclosure of data collection practices
- Right to know what data is held
- Right to delete
- Right to opt out of data sales
- Non-discrimination for exercising rights
Building a PII Governance Program
-
Discover PII across your platform. Scan databases and data lakes to find where PII lives.
-
Classify PII by type and sensitivity. Not all PII requires the same protection level.
-
Define protection requirements by classification. Encryption for sensitive, masking for internal use.
-
Implement technical controls. Access controls, encryption, masking, tokenization.
-
Monitor and audit. Track PII access, detect anomalous patterns.
-
Train teams. Ensure everyone handling PII understands their responsibilities.
-
Respond to incidents. Have a breach response plan ready.
PII Handling Production Failure Scenarios
Token vault becomes single point of failure
A token vault holds tokens for 50 million customer records. The vault runs on a single node without replication. The node fails and the vault cannot be recovered. All tokenized PII across 200 tables becomes permanently inaccessible. The business cannot link orders to customers.
Mitigation: Treat the token vault like production infrastructure — replication, backups, and failover. Consider vaultless tokenization approaches (deterministic encryption) for lower-stakes use cases. Never put the token vault on the critical path for real-time reads.
Synthetic data generation creates accidental re-identification
A developer generates synthetic test data by running a SQL query that pseudonymizes real customer names with a simple algorithm. The “synthetic” dataset still contains the original customer_id values, zip codes, and exact birthdates. This dataset is shared with a vendor for testing. An attacker correlates the synthetic records with public records and re-identifies individuals.
Mitigation: Never use real identifiers (customer_id, exact DOB) in synthetic data. Use fully generated identifiers and bucketed/dithered values. Treat synthetic PII the same as real PII.
Retention policy and backup policy conflict
A deletion job runs per the retention policy, removing customer records older than 3 years. However, backups are retained for 5 years. A regulatory audit requests data from 4 years ago. The production data is gone but backups still contain it. This creates ambiguity about what “deletion” means for compliance.
Mitigation: Define deletion as cryptographic erasure (overwrite) or physical destruction, not just logical deletion. Align backup retention with the strictest retention requirement across all PII types. Document your deletion and backup policies in the same place.
Role-based masking applied inconsistently
A row-level security policy is supposed to mask customer data based on region. The policy is implemented as a WHERE clause that filters rows, but the COUNT(*) aggregate still returns the full count. An analyst notices they can infer individual customer behavior by comparing the masked view with other reports. This is a classic statistical disclosure failure.
Mitigation: Test masking implementations for inference attacks before deploying. Ensure aggregates are also masked or rounded. Have a privacy engineer review analytical views before they go into production.
PII Handling Observability Hooks
Track these metrics for PII governance:
-- PII classification coverage: % of columns in sensitive datasets that are classified
SELECT
COUNT(DISTINCT dataset_id || '.' || column_name) AS classified_columns,
COUNT(DISTINCT dataset_id || '.' || column_name) * 100.0 /
(SELECT COUNT(*) FROM information_schema.columns
WHERE table_schema = 'warehouse') AS coverage_pct
FROM pii_classification;
-- Detect new PII columns added without classification
SELECT table_name, column_name
FROM information_schema.columns
WHERE table_schema = 'warehouse'
AND column_name LIKE '%ssn%'
OR column_name LIKE '%password%'
OR column_name LIKE '%secret%'
OR column_name LIKE '%token%'
MINUS
SELECT dataset_id, column_name FROM pii_classification;
-- Access to sensitive PII outside business hours
SELECT
user_id,
dataset_id,
COUNT(*) AS access_count,
MAX(access_timestamp) AS last_access
FROM pii_access_log
WHERE EXTRACT(HOUR FROM access_timestamp) NOT BETWEEN 8 AND 18
GROUP BY user_id, dataset_id;
Alert on: new columns matching PII patterns that are not yet classified, access to RESTRICTED PII by non-approved roles, bulk exports of PII columns.
PII Handling Anti-Patterns
Storing PII you do not need. If you do not need an email address to serve the business function, do not store it. Every PII field you keep is a liability.
Encryption without key management. Encrypting PII is only as strong as your key management. Encryption with keys stored in the same database or in code is not real encryption.
Masking only in production. Masking PII in the data warehouse but keeping it raw in the staging layer defeats the purpose. Apply protection consistently across all environments.
Treating anonymization as permanent. Data that seems anonymized may become re-identifiable with new datasets. Treat anonymized data with the same care as PII, or pseudonymize with a reversible mechanism if re-identification might be needed later.
PII Handling Quick Recap
- PII is any data that can identify an individual: direct identifiers (SSN, email) and indirect identifiers (DOB + zip code combinations).
- Classify PII across your platform before applying protection. Not all PII needs the same level of protection.
- Encryption for data that must be recoverable. Masking for analytical environments. Tokenization for linking across systems. Pseudonymization for GDPR compliance.
- Never use real PII in non-production environments. Generate fully synthetic data instead.
- Align deletion policies with backup retention. Define deletion as cryptographic erasure when possible.
- Monitor PII access for anomalous patterns. Alert on bulk exports and off-hours access to sensitive columns.
For related reading on data protection, see Data Governance for the broader framework of data management. For audit trails, see Audit Trails for tracking data access and changes., see Data Governance for the broader framework of data management. For audit trails, see Audit Trails for tracking data access and changes.
Category
Related Posts
Data Masking Strategies for Non-Production Environments
Learn static and dynamic data masking: nulling, shuffling, hashing, and range techniques. Understand GDPR and PII considerations for PostgreSQL and Oracle.
Audit Trails: Building Complete Data Accountability
Learn how to implement comprehensive audit trails that track data changes, access, and lineage for compliance and debugging.
Data Governance: Practical Implementation Guide
Learn the essential framework for data governance including data ownership, quality standards, policy enforcement, and organizational alignment.