Data Governance for AI
Data governance is fundamental to AI system quality, fairness, and compliance. This lesson covers comprehensive data governance practices aligned with ISO 42001 requirements.
Why Data Governance Matters for AI
Quality: AI systems are only as good as their training data Fairness: Biased data leads to discriminatory AI systems Compliance: GDPR, data protection laws require proper data handling Trust: Transparent data practices build stakeholder confidence Risk Management: Poor data governance creates significant risks Reproducibility: Proper governance enables repeatable results
ISO 42001 Data Requirements
Annex A.2 - Data Governance:
- Data management framework established
- Data quality requirements defined
- Data provenance tracked
- Data security implemented
Annex A.3 - Training Data Management:
- Training data selection criteria
- Data representativeness assessment
- Bias identification and mitigation
- Data versioning and traceability
Data Governance Framework
1. Data Governance Organization
Data Governance Board:
- Strategic oversight of data practices
- Policy approval and updates
- Resource allocation
- Risk oversight
Chief Data Officer (CDO):
- Overall accountability for data governance
- Strategy and roadmap
- Cross-functional coordination
- Compliance oversight
Data Stewards:
- Domain-specific data ownership
- Quality monitoring
- Access control management
- Metadata management
Data Engineers:
- Data pipeline development
- Quality automation
- Infrastructure management
- Performance optimization
Data Scientists:
- Data requirements definition
- Quality assessment
- Bias analysis
- Feature engineering
Privacy/Legal Team:
- Compliance verification
- Legal basis validation
- Privacy impact assessments
- Contract review
2. Data Quality Standards
Dimensions of Data Quality:
| Dimension | Definition | Measurement | Target |
|---|---|---|---|
| Accuracy | Data correctly represents reality | Validation against ground truth | >95% |
| Completeness | All required data is present | Missing value rate | <5% |
| Consistency | Data is consistent across sources | Cross-reference checks | 100% |
| Timeliness | Data is current and up-to-date | Age of data | Defined per use case |
| Validity | Data conforms to defined formats | Format validation rate | 100% |
| Uniqueness | No inappropriate duplication | Duplicate detection | <1% |
| Integrity | Relationships maintained correctly | Referential integrity checks | 100% |
Data Quality Policy Template:
# DATA QUALITY POLICY
## 1. DATA QUALITY STANDARDS
### Accuracy Requirements
- All training data must be validated against authoritative sources
- Minimum 95% accuracy required for training
- Validation sampling: minimum 10% of dataset
- Discrepancies must be investigated and resolved
### Completeness Requirements
- Critical fields: 0% missing values allowed
- Important fields: <5% missing values
- Optional fields: <20% missing values
- Missing value handling must be documented
### Consistency Requirements
- Data must be consistent across all sources
- Conflicts must be resolved before use
- Reconciliation process documented
- Audit trail maintained
### Timeliness Requirements
- Maximum data age defined per use case
- Staleness checks automated
- Refresh frequency documented
- Change data capture implemented
### Validity Requirements
- All data must pass schema validation
- Format checks automated
- Range validation performed
- Business rule validation applied
## 2. DATA QUALITY PROCESSES
### Data Profiling
- Automated profiling on ingestion
- Statistical analysis performed
- Anomaly detection applied
- Quality reports generated
### Data Validation
- Multi-level validation (format, business rules, cross-reference)
- Validation rules documented and versioned
- Failed validations logged and investigated
- Remediation tracked
### Data Monitoring
- Continuous quality monitoring
- Quality metrics tracked
- Alerts for quality degradation
- Dashboard for visibility
### Data Quality Improvement
- Root cause analysis for issues
- Corrective actions implemented
- Preventive measures applied
- Continuous improvement tracked
## 3. ROLES AND RESPONSIBILITIES
- **Data Owners**: Define quality requirements
- **Data Stewards**: Monitor and enforce quality
- **Data Engineers**: Implement quality checks
- **Data Scientists**: Validate fitness for purpose
- **Quality Team**: Audit and report
3. Data Lineage
Purpose:
- Track data from source to consumption
- Enable impact analysis
- Support reproducibility
- Facilitate auditing
- Enable compliance
Lineage Components:
-
Source Tracking
- Original data sources
- Collection methods
- Collection dates
- Legal basis for collection
-
Transformation Tracking
- All transformations applied
- Transformation logic
- Transformation dates
- Version of transformation code
-
Usage Tracking
- Which models use which data
- Purpose of usage
- Access history
- Retention periods
-
Dependency Mapping
- Upstream dependencies
- Downstream consumers
- Impact relationships
- Data flow diagrams
Lineage Documentation Example:
# Data Lineage Record
dataset_id: customer_features_v2.3
dataset_name: Customer Features for Churn Prediction
source_data:
- source_id: crm_database
tables: [customers, transactions, interactions]
extraction_date: 2025-12-01
extraction_method: daily_batch_job
legal_basis: legitimate_interest
- source_id: web_analytics
tables: [sessions, events]
extraction_date: 2025-12-01
extraction_method: streaming_pipeline
legal_basis: consent
transformations:
- step: 1
transformation: data_cleaning
description: Remove invalid records, handle missing values
code_version: v1.2.3
date: 2025-12-01
- step: 2
transformation: feature_engineering
description: Create aggregated features, behavioral indicators
code_version: v2.1.0
date: 2025-12-01
- step: 3
transformation: normalization
description: Scale numerical features, encode categoricals
code_version: v1.0.5
date: 2025-12-01
consumers:
- model: churn_prediction_v2.1
purpose: predict customer churn
access_date: 2025-12-02
- model: lifetime_value_v1.3
purpose: estimate customer lifetime value
access_date: 2025-12-02
quality_metrics:
accuracy: 0.97
completeness: 0.99
consistency: 1.0
timeliness: current
retention:
retention_period: 2_years
deletion_date: 2027-12-01
legal_requirement: gdpr_article_5
4. Data Cataloging
Purpose:
- Centralized metadata repository
- Discoverability of data assets
- Understanding of data characteristics
- Compliance documentation
Catalog Components:
-
Dataset Metadata
- Name, description, purpose
- Owner and steward
- Creation and update dates
- Size and structure
- Location and access
-
Schema Information
- Field names and types
- Field descriptions
- Constraints and relationships
- Sample values
-
Quality Metrics
- Quality dimensions and scores
- Known issues
- Validation rules
- Monitoring status
-
Usage Information
- Who is using the data
- How it's being used
- Usage frequency
- Dependencies
-
Compliance Information
- Sensitivity classification
- Legal basis for processing
- Retention requirements
- Privacy considerations
Data Catalog Entry Template:
# Dataset: Customer Transaction History
## Basic Information
- **ID**: ds_customer_transactions_v3
- **Owner**: Sales Analytics Team
- **Steward**: John Smith ([email protected])
- **Created**: 2024-01-15
- **Last Updated**: 2025-12-08
- **Update Frequency**: Daily
## Description
Customer transaction data including purchases, returns, and refunds. Used for customer behavior analysis, churn prediction, and personalization.
## Schema
| Field | Type | Description | Required | PII |
|-------|------|-------------|----------|-----|
| customer_id | string | Unique customer identifier | Yes | No |
| transaction_id | string | Unique transaction identifier | Yes | No |
| transaction_date | date | Date of transaction | Yes | No |
| amount | decimal | Transaction amount in USD | Yes | No |
| category | string | Product category | Yes | No |
| payment_method | string | Payment method used | No | No |
| email | string | Customer email | No | Yes |
## Data Quality
- **Accuracy**: 98.5%
- **Completeness**: 99.2%
- **Timeliness**: <24 hours lag
- **Validity**: 100%
- **Last Quality Check**: 2025-12-08
## Access
- **Location**: s3://data-lake/customer-transactions/
- **Format**: Parquet
- **Access Level**: Confidential
- **Access Request**: Submit to [email protected]
## Usage
- **Primary Use**: Customer analytics and ML models
- **Current Consumers**: Churn model, LTV model, recommendation engine
- **Query Frequency**: ~500 queries/day
## Compliance
- **Classification**: Personal Data
- **Legal Basis**: Contractual necessity
- **Retention**: 7 years (legal requirement)
- **Geographic Restrictions**: None
- **GDPR Considerations**: Subject to DSR requests
## Lineage
- **Source**: Point-of-sale systems, E-commerce platform
- **Transformations**: Data cleaning, enrichment
- **Related Datasets**: customer_profile, product_catalog
## Known Issues
- Payment method field has 15% null values (legacy data)
- Refund transactions have negative amounts
- Some historical data lacks category information
## Contact
- **Questions**: [email protected]
- **Issues**: jira.company.com/data-quality
5. Data Access Controls
Principles:
- Least privilege access
- Need-to-know basis
- Segregation of duties
- Regular access reviews
Access Control Framework:
-
Data Classification
- Public: No restrictions
- Internal: Employees only
- Confidential: Specific roles only
- Restricted: Explicit approval required
-
Access Levels
- Read: View data only
- Query: Run approved queries
- Write: Modify data
- Admin: Full control
-
Access Request Process
- Request submission with justification
- Manager approval
- Data owner approval (for confidential/restricted)
- Compliance review (for sensitive data)
- Time-limited access
- Regular re-certification
-
Technical Controls
- Authentication (MFA required)
- Authorization (RBAC/ABAC)
- Encryption (in transit and at rest)
- Audit logging
- Data masking for non-production
- Network segregation
Access Control Matrix:
| Data Type | Public | Internal User | Data Scientist | Data Engineer | Data Admin |
|---|---|---|---|---|---|
| Public datasets | Read | Read | Read, Query | Read, Query, Write | Full |
| Internal analytics | - | Read | Read, Query | Read, Query, Write | Full |
| Customer PII | - | - | Masked Read | Masked Read, Write | Full |
| Training datasets | - | - | Read, Query | Read, Query, Write | Full |
| Model artifacts | - | Read | Read, Write | Read, Write | Full |
| Sensitive PII | - | - | - | - | Audit logged |
6. Training Data Governance
Special Considerations for Training Data:
-
Representativeness
- Must represent target population
- Balanced across key demographics
- Edge cases included
- Sufficient sample size
-
Bias Assessment
- Historical bias identification
- Sampling bias analysis
- Label bias detection
- Proxy attribute analysis
-
Documentation Requirements
- Datasheet for datasets
- Collection methodology
- Known limitations
- Recommended uses
Training Data Checklist:
## TRAINING DATA GOVERNANCE CHECKLIST
### Data Selection
☐ Clear selection criteria defined
☐ Relevance to problem verified
☐ Representativeness assessed
☐ Sample size justified
☐ Class balance evaluated
### Bias Assessment
☐ Demographic representation analyzed
☐ Historical bias identified
☐ Sampling bias evaluated
☐ Label quality assessed
☐ Proxy variables checked
### Quality Verification
☐ Accuracy validated
☐ Completeness checked
☐ Consistency verified
☐ Outliers identified and handled
☐ Noise level assessed
### Documentation
☐ Datasheet created
☐ Collection process documented
☐ Known limitations listed
☐ Recommended uses specified
☐ Prohibited uses identified
### Legal and Ethical
☐ Legal basis for processing verified
☐ Consent obtained where required
☐ Privacy assessment completed
☐ Ethical review conducted
☐ Third-party data agreements checked
### Technical Preparation
☐ Data cleaned and preprocessed
☐ Train/validation/test split created
☐ Data versioned
☐ Data stored securely
☐ Access controls applied
### Monitoring
☐ Quality monitoring setup
☐ Drift detection configured
☐ Regular reviews scheduled
☐ Update process defined
☐ Retirement criteria established
Datasheet for Datasets Template:
# DATASHEET FOR DATASET
## Motivation
**Purpose**: Why was the dataset created?
**Creator**: Who created the dataset?
**Funding**: Who funded the creation?
## Composition
**What**: What do the instances represent?
**Number**: How many instances?
**Missing Data**: Is any information missing?
**Confidentiality**: Does the dataset contain confidential data?
## Collection Process
**How**: How was the data collected?
**Who**: Who was involved in collection?
**Timeframe**: Over what timeframe was data collected?
**Ethical Review**: Was ethical review conducted?
## Preprocessing
**Preprocessing**: What preprocessing was performed?
**Raw Data**: Is raw data saved?
## Uses
**Prior Uses**: Has the dataset been used already?
**Repository**: Is there a repository for the dataset?
**Impact**: What impacts might use have?
## Distribution
**Distribution**: Will the dataset be distributed?
**How**: How will it be distributed?
**When**: When will it be distributed?
**License**: What license applies?
## Maintenance
**Maintainer**: Who is maintaining the dataset?
**Contact**: How can they be contacted?
**Updates**: Will the dataset be updated?
**Retention**: How long will it be retained?
**Versioning**: How are versions managed?
7. Synthetic Data Considerations
When to Use Synthetic Data:
- Privacy protection required
- Insufficient real data available
- Testing and development
- Data augmentation
- Bias mitigation
Synthetic Data Governance:
-
Generation Process
- Generation method documented
- Statistical properties preserved
- Privacy guarantees verified
- Quality assessment performed
-
Validation Requirements
- Similarity to real data measured
- Utility for intended purpose verified
- Privacy protection confirmed
- Limitations documented
-
Usage Controls
- Appropriate use cases defined
- Limitations communicated
- Mixture with real data controlled
- Separate tracking and versioning
Synthetic Data Quality Metrics:
| Metric | Description | Target |
|---|---|---|
| Fidelity | Statistical similarity to real data | >0.9 |
| Utility | Performance on downstream tasks | >90% of real data |
| Privacy | Distance to nearest real record | >threshold |
| Diversity | Coverage of feature space | Similar to real |
| Fairness | Demographic representation | Balanced |
8. Data Privacy and Protection
Privacy Principles:
- Data minimization
- Purpose limitation
- Storage limitation
- Accuracy
- Integrity and confidentiality
- Accountability
Privacy Controls:
-
Privacy by Design
- Privacy requirements in design phase
- Default privacy settings
- End-to-end privacy
- Proactive not reactive
-
Data Minimization
- Collect only necessary data
- Aggregate when possible
- Anonymize where feasible
- Delete when no longer needed
-
De-identification
- Remove direct identifiers
- Suppress quasi-identifiers
- Generalization and suppression
- K-anonymity, l-diversity, t-closeness
-
Technical Measures
- Encryption (at rest and in transit)
- Access controls
- Pseudonymization
- Differential privacy (for aggregates)
Privacy Impact Assessment:
## PRIVACY IMPACT ASSESSMENT
### 1. DATA PROCESSING DESCRIPTION
- What data is being processed?
- Why is it being processed?
- Who will access the data?
- How long will it be retained?
### 2. NECESSITY AND PROPORTIONALITY
- Is data processing necessary?
- Is it proportionate to the purpose?
- Can the purpose be achieved with less data?
### 3. INDIVIDUAL RIGHTS
- How will individuals exercise rights?
- Right to access
- Right to rectification
- Right to erasure
- Right to restrict processing
- Right to data portability
- Right to object
### 4. RISK ASSESSMENT
- Risk of unauthorized access
- Risk of data breach
- Risk of discrimination
- Risk to fundamental rights
### 5. MITIGATION MEASURES
- Technical measures (encryption, access controls)
- Organizational measures (policies, training)
- Privacy by design features
- Monitoring and audit
### 6. COMPLIANCE
- Legal basis for processing
- GDPR compliance
- Local law compliance
- Contractual requirements
Data Governance Technology Stack
Data Catalog: Alation, Collibra, DataHub Data Quality: Great Expectations, Soda, Deequ Data Lineage: Apache Atlas, Marquez, DataHub Access Control: Immuta, Privacera, Apache Ranger Privacy: Privitar, Immuta, Gretel Metadata Management: DataHub, Amundsen, Atlas
Best Practices
- Governance First: Establish governance before scaling AI
- Automate Quality: Automated quality checks in pipelines
- Document Everything: Comprehensive metadata and lineage
- Privacy by Design: Build in privacy from the start
- Continuous Monitoring: Real-time quality and compliance
- Regular Audits: Periodic governance audits
- Training: Educate all data users
- Culture: Data quality is everyone's responsibility
- Iterative: Start simple, improve continuously
- Stakeholder Engagement: Involve business and compliance
Integration with ISO 42001
| Data Governance Area | ISO 42001 Requirements |
|---|---|
| Data Quality | Annex A.2.1, A.3.1 |
| Data Lineage | Annex A.2.2, A.3.3 |
| Data Cataloging | Annex A.2.3 |
| Access Controls | Annex A.2.4, ISO 27001 |
| Training Data | Annex A.3 (all controls) |
| Privacy | GDPR, local regulations |
| Monitoring | Annex A.7.1 |
Next Steps
- Assess current data governance maturity
- Identify gaps against ISO 42001 requirements
- Establish data governance organization
- Implement data quality standards
- Deploy data catalog and lineage tools
- Implement access controls
- Create training data procedures
- Train teams on data governance
- Monitor and improve continuously
Next Lesson: Model Development Controls - Ensuring AI models are developed with appropriate rigor and controls.