Medallion Architecture for Orange Money: Data Pipelines on Azure Databricks

The Real Problem Isn't the API

Orange Money provides a REST API for partners. Getting the transactions isn't the challenge. The challenge is what comes after:

Idempotence: The same transaction can appear twice in two successive API calls. Without deduplication, your aggregates are wrong.
Quality: 2-5% of transactions have zero amounts, invalid statuses, or missing timestamps. Without quality checks, your Gold Layer is contaminated.
CDP Compliance: Phone numbers are personal data under Article 4 of Senegalese law. Storing them in plaintext in a Data Lake is immediate non-compliance.
Volume: 5,000+ transactions/day for an average merchant, 500,000+ for a bank. Without partitioning, Gold queries take 30 seconds instead of 2.

The Medallion Architecture: Why Three Layers?

The Medallion Architecture (Bronze → Silver → Gold), popularized by Databricks, isn't a buzzword. It's a direct answer to the four problems above.

┌──────────────────────────────────────────────┐
│           ORANGE MONEY API                   │
│        (or Mock Generator for dev)           │
└─────────────────┬────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────┐
│ 🥉 BRONZE — raw_transactions                │
│ • Append-only, never modified               │
│ • PII tokenization at ingestion (CDP)       │
│ • Format: Delta Lake                        │
└─────────────────┬────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────┐
│ 🥈 SILVER — normalized_transactions         │
│ • Deduplication (ROW_NUMBER)                │
│ • 7 automated quality rules                 │
│ • Enrichment: buckets, peak hours           │
└─────────────────┬────────────────────────────┘
                  │
                  ▼
┌──────────────────────────────────────────────┐
│ 🥇 GOLD — daily_summary, fraud_patterns     │
│ • Dashboard-ready KPIs                      │
│ • Anomaly detection                         │
│ • Partitioned, aggregated, documented       │
└──────────────────────────────────────────────┘

📐

The golden rule: Bronze is immutable. Silver is reliable. Gold is actionable.

Bronze: Ingestion That Protects Personal Data

The Bronze layer receives raw transactions from the Orange Money API. But before writing a single row to Delta Lake, a critical step occurs: phone number tokenization.

Python

# CDP tokenization at ingestion — Art. 44
class PhoneTokenizer:
    """HMAC-SHA256 for CDP compliance."""

    def tokenize(self, phone: str) -> str:
        return hmac.new(
            self.secret, phone.encode(), hashlib.sha256
        ).hexdigest()[:16]

# Before writing to Bronze
if self._tokenizer:
    tx.sender_phone = self._tokenizer.tokenize(tx.sender_phone)
    tx.recipient_phone = self._tokenizer.tokenize(tx.recipient_phone)

Why this matters: Senegal's CDP (Article 44) requires pseudonymization of personal data. By tokenizing at ingestion — before data reaches storage — you eliminate the risk of plaintext PII in your Data Lake. Even unauthorized access to the Bronze table reveals no real phone numbers.

Silver: The Normalization That Makes the Difference

This is where most projects fail. They jump straight from Bronze to Gold — and end up with dashboards showing incorrect totals due to duplicates and corrupted data.

1. Deduplication

Python

window = Window.partitionBy("transaction_id") \
    .orderBy(F.col("_ingested_at").desc())

deduped = bronze \
    .withColumn("_row_num", F.row_number().over(window)) \
    .filter(F.col("_row_num") == 1) \
    .drop("_row_num")

2. Quality Checks

Python

QUALITY_RULES = {
    "amount_positive": F.col("amount") > 0,
    "amount_reasonable": F.col("amount") < 50_000_000,
    "valid_status": F.col("status").isin(
        "SUCCESS", "PENDING", "FAILED", "REVERSED"
    ),
    "fee_non_negative": F.col("fee") >= 0,
}

Rows that fail are not rejected — they are kept with a FAIL flag, enabling audit without data loss.

3. Enrichment

Enriched Column	Example	Purpose
_date	2025-01-15	Partitioning
_hour	14	Hourly analysis
_is_peak_hour	True	Peak hours (9am-12pm, 3pm-6pm)
_amount_bucket	MEDIUM	Segmentation (MICRO → XLARGE)
_is_merchant_txn	True	Merchant vs P2P
_processing_latency_seconds	3	Initiation → completion delay

Gold: Aggregates That Speak to Decision-Makers

Fraud Detection

Pattern	Rule	Example
VELOCITY_ANOMALY	> 20 txns/day per sender	Account sends 45 transfers in one day
AMOUNT_SPIKE	Txn > 5× 7-day moving average	500K FCFA txn (avg: 45K)
RAPID_TRANSFER	Multiple transfers to same recipient < 10 min	3 transfers of 50K in 4 minutes
OFF_HOURS_LARGE	Txn > 100K outside peak hours	200K payment at 3am

What This Architecture Delivers

Before	After
Ad-hoc Python scripts per employee	Single, versioned, tested pipeline
Phone numbers in plaintext	HMAC-SHA256 tokenization at ingestion
Undetected duplicates → wrong KPIs	Deterministic dedup in Silver
2-5% corrupt data in reports	Quality flag on every row, full audit trail
Dashboards: 30s load time	Partitioning → 2s load time
No fraud detection	4 automated anomaly patterns
Manual deployment	terraform apply → full infrastructure

Source Code

The complete codebase is open-source (MIT):

github.com/ibfaye/orange-money-pipeline

The Real Problem Isn't the API

The Medallion Architecture: Why Three Layers?

Bronze: Ingestion That Protects Personal Data

Silver: The Normalization That Makes the Difference

1. Deduplication

2. Quality Checks

3. Enrichment

Gold: Aggregates That Speak to Decision-Makers

Fraud Detection

What This Architecture Delivers

Source Code

Working on Orange Money, Wave, or Free Money?