PII Handling: Protecting Personal Data in Data Systems

Learn techniques for identifying, protecting, and managing personally identifiable information across your data platform.

published: reading time: 14 min read

PII Handling: Protecting Personal Data in Data Systems

Personal Identifiable Information (PII) is any data that can identify a specific individual. Names, email addresses, phone numbers, social security numbers, IP addresses, biometric data. In an era of increasing privacy regulation and data breaches, handling PII correctly is both a legal obligation and a trust imperative.

Getting PII handling right requires understanding what PII is, where it lives in your systems, how to protect it, and what controls to put around it. This is not just a compliance exercise. It is a fundamental part of building data systems that people can trust.

What Counts as PII

PII is broadly defined. It is any information that, alone or combined with other data, can identify an individual.

Direct Identifiers

Direct identifiers uniquely identify a specific person. These are obviously PII:

  • Full name
  • Social security number
  • Passport number
  • Driver’s license number
  • Email address
  • Phone number

Indirect Identifiers

Indirect identifiers can identify an individual when combined with other information:

  • Date of birth
  • Zip code
  • Gender
  • Race or ethnicity
  • Job title and employer
  • Geographic indicators

A combination of date of birth, zip code, and gender can identify a surprisingly large percentage of individuals.

Sensitive Personal Data

Some PII requires additional protections due to its sensitivity:

  • Medical/health information
  • Financial account information
  • Biometric data
  • Genetic data
  • Precise location data
  • Sexual orientation
  • Religious beliefs

This data is sometimes called “special category” data under regulations like GDPR.

When to Use Each Protection Technique

TechniqueWhen to UseKey Risk
EncryptionData at rest in storage, PII that must be recoverableKey management complexity, key loss = data loss
MaskingNon-production environments, internal reports, analytical viewsMasked data is not usable for joins or lookups
TokenizationWhen original values needed downstream but must be protected in storageToken vault is a single point of failure
PseudonymizationWhen you need to track records across datasets while preserving identity separabilityReversible with the salt — salt compromise breaks pseudonymity
DeletionWhen retention period expires, user exercises right to be forgottenCannot recover deleted data; ensure backup policies align

Apply the minimum protection needed. Not every PII field requires encryption—masking may be sufficient for internal analytics.

Classifying PII in Your Data

The first step is knowing what PII you have. This requires classifying data across your platform.

-- PII classification registry
CREATE TABLE pii_classification (
    dataset_id VARCHAR(100) NOT NULL,
    column_name VARCHAR(100) NOT NULL,
    pii_type VARCHAR(50) NOT NULL,  -- DIRECT, INDIRECT, SENSITIVE, NON_PII
    pii_category VARCHAR(100),        -- NAME, EMAIL, PHONE, SSN, etc.
    protection_level VARCHAR(20) NOT NULL,  -- RESTRICTED, CONFIDENTIAL, INTERNAL
    retention_period_months INT,
    last_reviewed_date DATE,
    PRIMARY KEY (dataset_id, column_name)
);

-- Example classifications
INSERT INTO pii_classification (dataset_id, column_name, pii_type, pii_category, protection_level) VALUES
    ('warehouse.dim_customer', 'email', 'DIRECT', 'EMAIL', 'CONFIDENTIAL'),
    ('warehouse.dim_customer', 'phone', 'DIRECT', 'PHONE', 'CONFIDENTIAL'),
    ('warehouse.dim_customer', 'date_of_birth', 'INDIRECT', 'DOB', 'INTERNAL'),
    ('warehouse.dim_customer', 'city', 'INDIRECT', 'LOCATION', 'INTERNAL'),
    ('warehouse.fact_orders', 'order_id', 'NON_PII', 'ORDER_ID', 'PUBLIC'),
    ('warehouse.fact_orders', 'total_amount', 'NON_PII', 'AMOUNT', 'PUBLIC');

Techniques for PII Protection

Data Masking

Masking replaces sensitive data with a non-sensitive equivalent. The data is not reversible in most forms.

def mask_pii(value, pii_type):
    """Mask PII based on its type."""

    if value is None:
        return None

    if pii_type == 'EMAIL':
        # john.doe@example.com -> j***e@example.com
        parts = value.split('@')
        if len(parts) == 2:
            name = parts[0]
            domain = parts[1]
            masked_name = name[0] + '***' + name[-1] if len(name) > 2 else '***'
            return f"{masked_name}@{domain}"
        return '***@***'

    elif pii_type == 'PHONE':
        # (555) 123-4567 -> (***) ***-4567
        return '***-***-' + value[-4:] if len(value) >= 4 else '***'

    elif pii_type == 'SSN':
        # 123-45-6789 -> ***-**-6789
        return '***-**-' + value[-4:] if len(value) >= 4 else '***'

    elif pii_type == 'NAME':
        # John Doe -> J*** D***
        parts = value.split(' ')
        masked = ' '.join(p[0] + '***' for p in parts)
        return masked

    return '***'

Tokenization

Tokenization replaces sensitive data with a non-sensitive token. The token maps back to the original value in a secure token vault.

import hashlib
import secrets

class TokenVault:
    """Secure token vault for PII tokenization."""

    def __init__(self):
        self.tokens = {}  # token -> original value
        self.reverse = {}  # hash(original) -> token

    def tokenize(self, value, pii_type):
        """Replace PII with a token."""

        # Check if already tokenized
        value_hash = hashlib.sha256(value.encode()).hexdigest()
        if value_hash in self.reverse:
            return self.reverse[value_hash]

        # Generate new token
        token = secrets.token_hex(16)
        self.tokens[token] = {
            'original': value,
            'pii_type': pii_type,
            'created_at': datetime.now()
        }
        self.reverse[value_hash] = token

        return token

    def detokenize(self, token):
        """Retrieve original value from token."""
        return self.tokens.get(token, {}).get('original')


# Usage in ETL
def tokenize_pii_columns(df, token_vault):
    """Replace PII values with tokens."""

    tokenized = df.copy()

    for col in df.columns:
        pii_info = get_pii_classification(col)
        if pii_info and pii_info['pii_type'] != 'NON_PII':
            tokenized[col] = df[col].apply(
                lambda x: token_vault.tokenize(str(x), pii_info['pii_type'])
            )

    return tokenized

Pseudonymization

Pseudonymization replaces identifying fields with artificial identifiers, but unlike tokenization, it is reversible if you have access to the mapping. This is a key concept in GDPR, which distinguishes between pseudonymized data (still personal data, but protected) and anonymized data (no longer personal data).

def pseudonymize(df, identifying_columns, salt):
    """Replace identifying fields with pseudonyms using a salt."""

    pseudonymized = df.copy()

    for col in identifying_columns:
        # Create deterministic pseudonym based on salt
        def pseudonymize_value(value):
            if value is None:
                return None
            raw = f"{salt}-{value}"
            return hashlib.sha256(raw.encode()).hexdigest()[:16]

        pseudonymized[col] = df[col].apply(pseudonymize_value)

    return pseudonymized

Encryption

Encryption renders data unreadable without a decryption key. For PII at rest, encryption is essential.

from cryptography.fernet import Fernet

class PIIEncryption:
    """Encrypt and decrypt PII at the field level."""

    def __init__(self, key=None):
        if key is None:
            key = Fernet.generate_key()
        self.cipher = Fernet(key)

    def encrypt(self, value):
        """Encrypt a PII value."""
        if value is None:
            return None
        if isinstance(value, str):
            value = value.encode()
        return self.cipher.encrypt(value).decode()

    def decrypt(self, encrypted_value):
        """Decrypt a PII value."""
        if encrypted_value is None:
            return None
        return self.cipher.decrypt(encrypted_value.encode()).decode()


# Usage: encrypt PII before storing
def encrypt_pii_columns(df, encryption_service, columns_to_encrypt):
    """Encrypt specified PII columns."""

    encrypted = df.copy()

    for col in columns_to_encrypt:
        encrypted[col] = df[col].apply(
            lambda x: encryption_service.encrypt(str(x)) if x is not None else None
        )

    return encrypted

PII in Different Environments

flowchart TB
    subgraph "PII Data Flow"
        S[("Source System\nCRM, ERP, etc.")]
        C[("Classification\nRegistry")]
        P[("Protection Layer\nEncrypt / Mask / Tokenize")]
        D[("De-identified\nData Store")]
        A[("Access Control\nRBAC / Column Security")]
        R[("Reporting /\nAnalytics")]
    end
    S --> C
    C --> P
    P --> D
    D --> A
    A --> R

Development and Testing

Never use real PII in non-production environments. Use synthetic data instead.

def generate_synthetic_pii(pii_type, count=100):
    """Generate synthetic PII for testing."""

    synthetic_generators = {
        'EMAIL': lambda: f"user{counter}@example.com",
        'PHONE': lambda: f"+1-555-{random.randint(100,999)}-{random.randint(1000,9999)}",
        'NAME': lambda: f"TestUser{random.randint(1, 1000)}",
        'SSN': lambda: f"{random.randint(100,999)}-{random.randint(10,99)}-{random.randint(1000,9999)}",
        'ADDRESS': lambda: f"{random.randint(1,9999)} Test Street",
        'DOB': lambda: f"{random.randint(1950,2000)}-{random.randint(1,12):02d}-{random.randint(1,28):02d}"
    }

    generator = synthetic_generators.get(pii_type, lambda: f"SYNTHETIC_{pii_type}")
    return [generator() for _ in range(count)]

Data Lake and Warehouse

In analytical environments, minimize PII exposure. Store only what is necessary for analysis.

-- Create a de-identified view of customer data
CREATE VIEW v_dim_customer_deidentified AS
SELECT
    customer_key,  -- Pseudonymized key, not the original ID
    customer_name,  -- Masked
    CASE
        WHEN date_of_birth IS NOT NULL
        THEN DATE_FORMAT(DATE_SUB(date_of_birth, INTERVAL YEAR(CURRENT_DATE) - YEAR(date_of_birth) YEAR), '%m-%d')
        -- Keep month-day only for birthday analysis
    END AS birthday_md,
    CASE
        WHEN city IS NOT NULL AND country = 'US'
        THEN city  -- Keep city, it's not uniquely identifying
        ELSE 'Other'
    END AS city,
    CASE
        WHEN income_bracket IS NOT NULL
        THEN income_bracket  -- bucketed, not exact
        ELSE 'Unknown'
    END AS income_bracket
FROM dim_customer;

-- Never expose these in reporting views
-- original customer_id (only for joins with tokenized key)
-- full date_of_birth (too identifying)
-- exact address
-- SSN (never store in warehouse if avoidable)

Access Controls for PII

PII should be protected by access controls that limit who can see it.

Role-Based Access

-- Define roles with PII access
CREATE TABLE roles (
    role_id INT PRIMARY KEY,
    role_name VARCHAR(50),
    pii_access_level VARCHAR(20)  -- NONE, MASKED, FULL
);

-- Assign PII column access by role
CREATE TABLE pii_column_access (
    role_id INT,
    dataset_id VARCHAR(100),
    column_name VARCHAR(100),
    access_level VARCHAR(20),  -- NONE, MASKED, FULL
    FOREIGN KEY (role_id) REFERENCES roles(role_id)
);

-- Example: Analyst role gets masked access to most PII
INSERT INTO pii_column_access (role_id, dataset_id, column_name, access_level)
SELECT r.role_id, 'warehouse.dim_customer', 'email', 'MASKED'
FROM roles r WHERE r.role_name = 'analyst';

-- Example: Data engineer role gets full access for debugging
INSERT INTO pii_column_access (role_id, dataset_id, column_name, access_level)
SELECT r.role_id, 'warehouse.dim_customer', 'email', 'FULL'
FROM roles r WHERE r.role_name = 'data_engineer';

Column-Level Security Implementation

def apply_pii_access_controls(df, user_role, dataset_id):
    """Apply PII access controls based on user role."""

    # Get access level for each column
    access_levels = get_pii_access_levels(user_role, dataset_id)

    controlled = df.copy()

    for col, access_level in access_levels.items():
        if col not in controlled.columns:
            continue

        if access_level == 'MASKED':
            controlled[col] = controlled[col].apply(
                lambda x: mask_pii(x, get_pii_type(col))
            )
        elif access_level == 'NONE':
            # Remove column entirely
            controlled = controlled.drop(columns=[col])

    return controlled

Data Retention and Deletion

PII should not be kept longer than necessary. This is both a privacy principle and a legal requirement.

def apply_retention_policy(df, dataset_name, current_date):
    """Apply retention policy to dataset."""

    retention = get_retention_policy(dataset_name)

    if retention is None:
        return df  # No policy defined

    retention_months = retention['retention_months']
    date_column = retention.get('date_column', 'created_date')

    cutoff_date = subtract_months(current_date, retention_months)

    # Filter to keep only recent records
    retained = df.filter(df[date_column] >= cutoff_date)

    # Log deletion
    deleted_count = len(df) - len(retained)
    if deleted_count > 0:
        log_data_deletion(
            dataset=dataset_name,
            deleted_count=deleted_count,
            cutoff_date=cutoff_date,
            retention_months=retention_months
        )

    return retained

The Right to be Forgotten

Under GDPR and similar regulations, individuals can request deletion of their data. Systems must be able to honor these requests.

def handle_deletion_request(user_id, dataset_ids):
    """Handle a data deletion request (right to be forgotten)."""

    request_id = create_deletion_request(user_id)

    for dataset_id in dataset_ids:
        # Find records for this user
        records = find_records_by_user(dataset_id, user_id)

        # Delete or anonymize
        if can_fully_delete(dataset_id):
            hard_delete(dataset_id, user_id)
            log_deletion(request_id, dataset_id, len(records), 'FULL')
        else:
            # Pseudonymize instead of deleting
            pseudonymize_records(dataset_id, user_id)
            log_deletion(request_id, dataset_id, len(records), 'PSEUDONYMIZED')

    complete_deletion_request(request_id)

Monitoring PII Access

Track who accesses PII and when.

-- PII access audit log
CREATE TABLE pii_access_log (
    log_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id VARCHAR(100),
    dataset_id VARCHAR(100),
    column_name VARCHAR(100),
    access_type VARCHAR(20),  -- READ, EXPORT, DELETE
    access_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ip_address VARCHAR(50),
    session_id VARCHAR(100)
);

-- Trigger to log PII access
CREATE TRIGGER trg_pii_access_log
AFTER SELECT ON warehouse.dim_customer
FOR EACH ROW
BEGIN
    INSERT INTO pii_access_log (user_id, dataset_id, access_type)
    SELECT CURRENT_USER(), 'warehouse.dim_customer', 'READ';
END;

Compliance Frameworks

GDPR (European Union)

GDPR applies to EU residents’ personal data regardless of where your systems are located.

Key requirements:

  • Lawful basis for processing (consent, contract, legitimate interest)
  • Purpose limitation (only use data for stated purposes)
  • Data minimization (only collect what is necessary)
  • Accuracy (keep data accurate)
  • Storage limitation (do not keep longer than necessary)
  • Security (protect personal data)
  • Accountability (demonstrate compliance)

CCPA (California)

CCPA gives California residents rights to know what personal data is collected, request deletion, and opt out of data sales.

Key requirements:

  • Disclosure of data collection practices
  • Right to know what data is held
  • Right to delete
  • Right to opt out of data sales
  • Non-discrimination for exercising rights

Building a PII Governance Program

  1. Discover PII across your platform. Scan databases and data lakes to find where PII lives.

  2. Classify PII by type and sensitivity. Not all PII requires the same protection level.

  3. Define protection requirements by classification. Encryption for sensitive, masking for internal use.

  4. Implement technical controls. Access controls, encryption, masking, tokenization.

  5. Monitor and audit. Track PII access, detect anomalous patterns.

  6. Train teams. Ensure everyone handling PII understands their responsibilities.

  7. Respond to incidents. Have a breach response plan ready.

PII Handling Production Failure Scenarios

Token vault becomes single point of failure

A token vault holds tokens for 50 million customer records. The vault runs on a single node without replication. The node fails and the vault cannot be recovered. All tokenized PII across 200 tables becomes permanently inaccessible. The business cannot link orders to customers.

Mitigation: Treat the token vault like production infrastructure — replication, backups, and failover. Consider vaultless tokenization approaches (deterministic encryption) for lower-stakes use cases. Never put the token vault on the critical path for real-time reads.

Synthetic data generation creates accidental re-identification

A developer generates synthetic test data by running a SQL query that pseudonymizes real customer names with a simple algorithm. The “synthetic” dataset still contains the original customer_id values, zip codes, and exact birthdates. This dataset is shared with a vendor for testing. An attacker correlates the synthetic records with public records and re-identifies individuals.

Mitigation: Never use real identifiers (customer_id, exact DOB) in synthetic data. Use fully generated identifiers and bucketed/dithered values. Treat synthetic PII the same as real PII.

Retention policy and backup policy conflict

A deletion job runs per the retention policy, removing customer records older than 3 years. However, backups are retained for 5 years. A regulatory audit requests data from 4 years ago. The production data is gone but backups still contain it. This creates ambiguity about what “deletion” means for compliance.

Mitigation: Define deletion as cryptographic erasure (overwrite) or physical destruction, not just logical deletion. Align backup retention with the strictest retention requirement across all PII types. Document your deletion and backup policies in the same place.

Role-based masking applied inconsistently

A row-level security policy is supposed to mask customer data based on region. The policy is implemented as a WHERE clause that filters rows, but the COUNT(*) aggregate still returns the full count. An analyst notices they can infer individual customer behavior by comparing the masked view with other reports. This is a classic statistical disclosure failure.

Mitigation: Test masking implementations for inference attacks before deploying. Ensure aggregates are also masked or rounded. Have a privacy engineer review analytical views before they go into production.

PII Handling Observability Hooks

Track these metrics for PII governance:

-- PII classification coverage: % of columns in sensitive datasets that are classified
SELECT
    COUNT(DISTINCT dataset_id || '.' || column_name) AS classified_columns,
    COUNT(DISTINCT dataset_id || '.' || column_name) * 100.0 /
        (SELECT COUNT(*) FROM information_schema.columns
         WHERE table_schema = 'warehouse') AS coverage_pct
FROM pii_classification;

-- Detect new PII columns added without classification
SELECT table_name, column_name
FROM information_schema.columns
WHERE table_schema = 'warehouse'
AND column_name LIKE '%ssn%'
OR column_name LIKE '%password%'
OR column_name LIKE '%secret%'
OR column_name LIKE '%token%'
MINUS
SELECT dataset_id, column_name FROM pii_classification;

-- Access to sensitive PII outside business hours
SELECT
    user_id,
    dataset_id,
    COUNT(*) AS access_count,
    MAX(access_timestamp) AS last_access
FROM pii_access_log
WHERE EXTRACT(HOUR FROM access_timestamp) NOT BETWEEN 8 AND 18
GROUP BY user_id, dataset_id;

Alert on: new columns matching PII patterns that are not yet classified, access to RESTRICTED PII by non-approved roles, bulk exports of PII columns.

PII Handling Anti-Patterns

Storing PII you do not need. If you do not need an email address to serve the business function, do not store it. Every PII field you keep is a liability.

Encryption without key management. Encrypting PII is only as strong as your key management. Encryption with keys stored in the same database or in code is not real encryption.

Masking only in production. Masking PII in the data warehouse but keeping it raw in the staging layer defeats the purpose. Apply protection consistently across all environments.

Treating anonymization as permanent. Data that seems anonymized may become re-identifiable with new datasets. Treat anonymized data with the same care as PII, or pseudonymize with a reversible mechanism if re-identification might be needed later.

PII Handling Quick Recap

  • PII is any data that can identify an individual: direct identifiers (SSN, email) and indirect identifiers (DOB + zip code combinations).
  • Classify PII across your platform before applying protection. Not all PII needs the same level of protection.
  • Encryption for data that must be recoverable. Masking for analytical environments. Tokenization for linking across systems. Pseudonymization for GDPR compliance.
  • Never use real PII in non-production environments. Generate fully synthetic data instead.
  • Align deletion policies with backup retention. Define deletion as cryptographic erasure when possible.
  • Monitor PII access for anomalous patterns. Alert on bulk exports and off-hours access to sensitive columns.

For related reading on data protection, see Data Governance for the broader framework of data management. For audit trails, see Audit Trails for tracking data access and changes., see Data Governance for the broader framework of data management. For audit trails, see Audit Trails for tracking data access and changes.

Category

Related Posts

Data Masking Strategies for Non-Production Environments

Learn static and dynamic data masking: nulling, shuffling, hashing, and range techniques. Understand GDPR and PII considerations for PostgreSQL and Oracle.

#database #data-masking #security

Audit Trails: Building Complete Data Accountability

Learn how to implement comprehensive audit trails that track data changes, access, and lineage for compliance and debugging.

#data-engineering #audit #audit-trails

Data Governance: Practical Implementation Guide

Learn the essential framework for data governance including data ownership, quality standards, policy enforcement, and organizational alignment.

#data-engineering #data-governance #data-quality