Skip to content
Journal
databricksmedallion-architectureorange-moneydelta-lakesenegalmobile-moneycdp-compliancedata-engineering

Medallion Architecture for Orange Money: Data Pipelines on Azure Databricks

How to ingest, normalize, and analyze millions of Orange Money transactions with a Medallion architecture (Bronze → Silver → Gold) compliant with Senegal's CDP privacy law.

June 16, 20258 minBy Ibrahim Faye

The Real Problem Isn't the API

Orange Money provides a REST API for partners. Getting the transactions isn't the challenge. The challenge is what comes after:

  1. Idempotence: The same transaction can appear twice in two successive API calls. Without deduplication, your aggregates are wrong.
  2. Quality: 2-5% of transactions have zero amounts, invalid statuses, or missing timestamps. Without quality checks, your Gold Layer is contaminated.
  3. CDP Compliance: Phone numbers are personal data under Article 4 of Senegalese law. Storing them in plaintext in a Data Lake is immediate non-compliance.
  4. Volume: 5,000+ transactions/day for an average merchant, 500,000+ for a bank. Without partitioning, Gold queries take 30 seconds instead of 2.

The Medallion Architecture: Why Three Layers?

The Medallion Architecture (Bronze → Silver → Gold), popularized by Databricks, isn't a buzzword. It's a direct answer to the four problems above.

┌──────────────────────────────────────────────┐
│           ORANGE MONEY API                   │
│        (or Mock Generator for dev)           │
└─────────────────┬────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────┐
│ 🥉 BRONZE — raw_transactions                │
│ • Append-only, never modified               │
│ • PII tokenization at ingestion (CDP)       │
│ • Format: Delta Lake                        │
└─────────────────┬────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────┐
│ 🥈 SILVER — normalized_transactions         │
│ • Deduplication (ROW_NUMBER)                │
│ • 7 automated quality rules                 │
│ • Enrichment: buckets, peak hours           │
└─────────────────┬────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────┐
│ 🥇 GOLD — daily_summary, fraud_patterns     │
│ • Dashboard-ready KPIs                      │
│ • Anomaly detection                         │
│ • Partitioned, aggregated, documented       │
└──────────────────────────────────────────────┘
📐
The golden rule: Bronze is immutable. Silver is reliable. Gold is actionable.

Bronze: Ingestion That Protects Personal Data

The Bronze layer receives raw transactions from the Orange Money API. But before writing a single row to Delta Lake, a critical step occurs: phone number tokenization.

Python
# CDP tokenization at ingestion — Art. 44
class PhoneTokenizer:
    """HMAC-SHA256 for CDP compliance."""

    def tokenize(self, phone: str) -> str:
        return hmac.new(
            self.secret, phone.encode(), hashlib.sha256
        ).hexdigest()[:16]

# Before writing to Bronze
if self._tokenizer:
    tx.sender_phone = self._tokenizer.tokenize(tx.sender_phone)
    tx.recipient_phone = self._tokenizer.tokenize(tx.recipient_phone)

Why this matters: Senegal's CDP (Article 44) requires pseudonymization of personal data. By tokenizing at ingestion — before data reaches storage — you eliminate the risk of plaintext PII in your Data Lake. Even unauthorized access to the Bronze table reveals no real phone numbers.

Silver: The Normalization That Makes the Difference

This is where most projects fail. They jump straight from Bronze to Gold — and end up with dashboards showing incorrect totals due to duplicates and corrupted data.

1. Deduplication

Python
window = Window.partitionBy("transaction_id") \
    .orderBy(F.col("_ingested_at").desc())

deduped = bronze \
    .withColumn("_row_num", F.row_number().over(window)) \
    .filter(F.col("_row_num") == 1) \
    .drop("_row_num")

2. Quality Checks

Python
QUALITY_RULES = {
    "amount_positive": F.col("amount") > 0,
    "amount_reasonable": F.col("amount") < 50_000_000,
    "valid_status": F.col("status").isin(
        "SUCCESS", "PENDING", "FAILED", "REVERSED"
    ),
    "fee_non_negative": F.col("fee") >= 0,
}

Rows that fail are not rejected — they are kept with a FAIL flag, enabling audit without data loss.

3. Enrichment

Enriched ColumnExamplePurpose
_date2025-01-15Partitioning
_hour14Hourly analysis
_is_peak_hourTruePeak hours (9am-12pm, 3pm-6pm)
_amount_bucketMEDIUMSegmentation (MICRO → XLARGE)
_is_merchant_txnTrueMerchant vs P2P
_processing_latency_seconds3Initiation → completion delay

Gold: Aggregates That Speak to Decision-Makers

Fraud Detection

PatternRuleExample
VELOCITY_ANOMALY> 20 txns/day per senderAccount sends 45 transfers in one day
AMOUNT_SPIKETxn > 5× 7-day moving average500K FCFA txn (avg: 45K)
RAPID_TRANSFERMultiple transfers to same recipient < 10 min3 transfers of 50K in 4 minutes
OFF_HOURS_LARGETxn > 100K outside peak hours200K payment at 3am

What This Architecture Delivers

BeforeAfter
Ad-hoc Python scripts per employeeSingle, versioned, tested pipeline
Phone numbers in plaintextHMAC-SHA256 tokenization at ingestion
Undetected duplicates → wrong KPIsDeterministic dedup in Silver
2-5% corrupt data in reportsQuality flag on every row, full audit trail
Dashboards: 30s load timePartitioning → 2s load time
No fraud detection4 automated anomaly patterns
Manual deploymentterraform apply → full infrastructure

Source Code

The complete codebase is open-source (MIT):

Working on Orange Money, Wave, or Free Money?

XamXam Graph deploys turnkey Medallion architectures for mobile money data in West Africa — with built-in CDP compliance.

Schedule a Diagnostic