The Real Problem Isn't the API
Orange Money provides a REST API for partners. Getting the transactions isn't the challenge. The challenge is what comes after:
- Idempotence: The same transaction can appear twice in two successive API calls. Without deduplication, your aggregates are wrong.
- Quality: 2-5% of transactions have zero amounts, invalid statuses, or missing timestamps. Without quality checks, your Gold Layer is contaminated.
- CDP Compliance: Phone numbers are personal data under Article 4 of Senegalese law. Storing them in plaintext in a Data Lake is immediate non-compliance.
- Volume: 5,000+ transactions/day for an average merchant, 500,000+ for a bank. Without partitioning, Gold queries take 30 seconds instead of 2.
The Medallion Architecture: Why Three Layers?
The Medallion Architecture (Bronze → Silver → Gold), popularized by Databricks, isn't a buzzword. It's a direct answer to the four problems above.
┌──────────────────────────────────────────────┐
│ ORANGE MONEY API │
│ (or Mock Generator for dev) │
└─────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ 🥉 BRONZE — raw_transactions │
│ • Append-only, never modified │
│ • PII tokenization at ingestion (CDP) │
│ • Format: Delta Lake │
└─────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ 🥈 SILVER — normalized_transactions │
│ • Deduplication (ROW_NUMBER) │
│ • 7 automated quality rules │
│ • Enrichment: buckets, peak hours │
└─────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ 🥇 GOLD — daily_summary, fraud_patterns │
│ • Dashboard-ready KPIs │
│ • Anomaly detection │
│ • Partitioned, aggregated, documented │
└──────────────────────────────────────────────┘Bronze: Ingestion That Protects Personal Data
The Bronze layer receives raw transactions from the Orange Money API. But before writing a single row to Delta Lake, a critical step occurs: phone number tokenization.
# CDP tokenization at ingestion — Art. 44
class PhoneTokenizer:
"""HMAC-SHA256 for CDP compliance."""
def tokenize(self, phone: str) -> str:
return hmac.new(
self.secret, phone.encode(), hashlib.sha256
).hexdigest()[:16]
# Before writing to Bronze
if self._tokenizer:
tx.sender_phone = self._tokenizer.tokenize(tx.sender_phone)
tx.recipient_phone = self._tokenizer.tokenize(tx.recipient_phone)Why this matters: Senegal's CDP (Article 44) requires pseudonymization of personal data. By tokenizing at ingestion — before data reaches storage — you eliminate the risk of plaintext PII in your Data Lake. Even unauthorized access to the Bronze table reveals no real phone numbers.
Silver: The Normalization That Makes the Difference
This is where most projects fail. They jump straight from Bronze to Gold — and end up with dashboards showing incorrect totals due to duplicates and corrupted data.
1. Deduplication
window = Window.partitionBy("transaction_id") \
.orderBy(F.col("_ingested_at").desc())
deduped = bronze \
.withColumn("_row_num", F.row_number().over(window)) \
.filter(F.col("_row_num") == 1) \
.drop("_row_num")2. Quality Checks
QUALITY_RULES = {
"amount_positive": F.col("amount") > 0,
"amount_reasonable": F.col("amount") < 50_000_000,
"valid_status": F.col("status").isin(
"SUCCESS", "PENDING", "FAILED", "REVERSED"
),
"fee_non_negative": F.col("fee") >= 0,
}Rows that fail are not rejected — they are kept with a FAIL flag, enabling audit without data loss.
3. Enrichment
| Enriched Column | Example | Purpose |
|---|---|---|
| _date | 2025-01-15 | Partitioning |
| _hour | 14 | Hourly analysis |
| _is_peak_hour | True | Peak hours (9am-12pm, 3pm-6pm) |
| _amount_bucket | MEDIUM | Segmentation (MICRO → XLARGE) |
| _is_merchant_txn | True | Merchant vs P2P |
| _processing_latency_seconds | 3 | Initiation → completion delay |
Gold: Aggregates That Speak to Decision-Makers
Fraud Detection
| Pattern | Rule | Example |
|---|---|---|
| VELOCITY_ANOMALY | > 20 txns/day per sender | Account sends 45 transfers in one day |
| AMOUNT_SPIKE | Txn > 5× 7-day moving average | 500K FCFA txn (avg: 45K) |
| RAPID_TRANSFER | Multiple transfers to same recipient < 10 min | 3 transfers of 50K in 4 minutes |
| OFF_HOURS_LARGE | Txn > 100K outside peak hours | 200K payment at 3am |
What This Architecture Delivers
| Before | After |
|---|---|
| Ad-hoc Python scripts per employee | Single, versioned, tested pipeline |
| Phone numbers in plaintext | HMAC-SHA256 tokenization at ingestion |
| Undetected duplicates → wrong KPIs | Deterministic dedup in Silver |
| 2-5% corrupt data in reports | Quality flag on every row, full audit trail |
| Dashboards: 30s load time | Partitioning → 2s load time |
| No fraud detection | 4 automated anomaly patterns |
| Manual deployment | terraform apply → full infrastructure |
Source Code
The complete codebase is open-source (MIT):