Docs init
This commit is contained in:
135
docs/05_etl_pipeline.md
Normal file
135
docs/05_etl_pipeline.md
Normal file
@@ -0,0 +1,135 @@
|
||||
# ETL Pipeline
|
||||
|
||||
## Overview
|
||||
|
||||
The ETL (Extract, Transform, Load) pipeline is built in **Apache NiFi** and moves data from the MySQL OLTP database into the Oracle Data Mart. It runs 8 sequential pipelines — one per target table — each following the same processor chain.
|
||||
|
||||
For the detailed step-by-step NiFi configuration (which buttons to click, which properties to set), see `nifi/NIFI_SETUP.md`.
|
||||
|
||||
---
|
||||
|
||||
## Two Phases Before NiFi
|
||||
|
||||
### Phase 1 — Seed the OLTP (one-time)
|
||||
|
||||
Before NiFi runs, the MySQL database must be populated. This is done by the C# seed script:
|
||||
|
||||
```bash
|
||||
./docker/start.sh # start MySQL container
|
||||
dotnet run ./scripts/seed.cs
|
||||
```
|
||||
|
||||
The script reads all 10 CSV files, resolves foreign key relationships (game names, country lookups, organization cross-references), and inserts everything in the correct dependency order.
|
||||
|
||||
### Phase 2 — Create the Data Mart schema in Oracle (one-time)
|
||||
|
||||
Run `sql/datamart_schema.sql` against your Oracle lab schema before the first NiFi run. This creates all 8 tables. Oracle SQL Developer or any SQL client works for this.
|
||||
|
||||
---
|
||||
|
||||
## NiFi Processor Chain
|
||||
|
||||
Every pipeline follows this identical 5-processor pattern:
|
||||
|
||||
```
|
||||
ExecuteSQL ConvertAvroToJSON SplitJson
|
||||
(MySQL source) ──► (Avro → JSON) ──► (array → 1 FlowFile per row)
|
||||
│
|
||||
▼
|
||||
EvaluateJsonPath
|
||||
(JSON fields → FlowFile attributes)
|
||||
│
|
||||
▼
|
||||
PutSQL
|
||||
(Oracle target)
|
||||
```
|
||||
|
||||
| Processor | Role |
|
||||
|---|---|
|
||||
| **ExecuteSQL** | Runs the extract SQL on MySQL, produces Avro-encoded records |
|
||||
| **ConvertAvroToJSON** | Converts the Avro binary to a JSON array |
|
||||
| **SplitJson** | Splits the JSON array into one FlowFile per record |
|
||||
| **EvaluateJsonPath** | Reads each field from the JSON record and stores it as a named FlowFile attribute |
|
||||
| **PutSQL** | Runs the Oracle INSERT statement, substituting `${attribute}` placeholders with the FlowFile attribute values |
|
||||
|
||||
---
|
||||
|
||||
## The 8 Pipelines
|
||||
|
||||
Pipelines must run in this order because facts depend on all dimensions being loaded first.
|
||||
|
||||
### Dimensions (run first, in any order among themselves)
|
||||
|
||||
| # | Pipeline | Extract source | Rows |
|
||||
|---|---|---|---|
|
||||
| 1 | DIM_DATE | MySQL CTE (generates date range) | 48 |
|
||||
| 2 | DIM_MEDAL | No extract — 3 static rows inserted directly to Oracle | 3 |
|
||||
| 3 | DIM_GAME | `game` table | 25 |
|
||||
| 4 | DIM_COUNTRY | `country` table | 36+ |
|
||||
| 5 | DIM_ORGANIZATION | `organization` JOIN `country` | 60+ |
|
||||
|
||||
### Facts (run after all dimensions)
|
||||
|
||||
| # | Pipeline | Extract source | Rows |
|
||||
|---|---|---|---|
|
||||
| 6 | FACT_TOURNAMENT | `tournament` JOIN `schedule` JOIN `organization` | 27 |
|
||||
| 7 | FACT_MEDAL_AWARD | `medalist` JOIN `tournament` | 257 |
|
||||
| 8 | FACT_CLUB_STANDING | `club_championship_standing` | 24 |
|
||||
|
||||
---
|
||||
|
||||
## How Key Resolution Works
|
||||
|
||||
The OLTP stores natural keys (e.g. `game_id = 3`). The Oracle Data Mart uses surrogate keys generated by `GENERATED ALWAYS AS IDENTITY` (e.g. `game_key = 3`). These can differ if there is ever a gap or reorder.
|
||||
|
||||
The fact load SQL handles this by embedding a sub-SELECT inside each `INSERT...SELECT...FROM DUAL`:
|
||||
|
||||
```sql
|
||||
INSERT INTO FACT_TOURNAMENT (game_key, ...)
|
||||
SELECT
|
||||
(SELECT game_key FROM DIM_GAME WHERE game_id = ${game_id}),
|
||||
...
|
||||
FROM DUAL
|
||||
```
|
||||
|
||||
This means the Oracle database itself resolves the surrogate key at insert time, using the natural key that was extracted from MySQL and carried through as a FlowFile attribute. No transformation processor is needed.
|
||||
|
||||
---
|
||||
|
||||
## SQL File Layout
|
||||
|
||||
```
|
||||
nifi/sql/
|
||||
├── extract/
|
||||
│ ├── 01_dim_date.sql Recursive CTE to generate calendar rows
|
||||
│ ├── 02_dim_game.sql Simple SELECT from game
|
||||
│ ├── 03_dim_country.sql Simple SELECT from country
|
||||
│ ├── 04_dim_organization.sql SELECT with LEFT JOIN to country for country name
|
||||
│ ├── 05_fact_tournament.sql JOIN with schedule and org; computes date keys
|
||||
│ ├── 06_fact_medal_award.sql JOIN with tournament; computes medal_points
|
||||
│ └── 07_fact_club_standing.sql
|
||||
└── load/
|
||||
├── 01_dim_date.sql INSERT with TO_DATE conversion for Oracle
|
||||
├── 02_dim_medal.sql Static 3-row INSERT, run once directly
|
||||
├── 03_dim_game.sql
|
||||
├── 04_dim_country.sql
|
||||
├── 05_dim_organization.sql NULL-safe EL expressions for optional fields
|
||||
├── 06_fact_tournament.sql Sub-SELECT key lookups via game_id, org_id
|
||||
├── 07_fact_medal_award.sql Sub-SELECT key lookups; medal_type lookup
|
||||
└── 08_fact_club_standing.sql Sub-SELECT key lookup via org_id
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Transformations Performed
|
||||
|
||||
Most of the "transformation" work happens in the extract SQL rather than in NiFi processors, which keeps the NiFi flow simple.
|
||||
|
||||
| Transformation | Where it happens |
|
||||
|---|---|
|
||||
| Date → YYYYMMDD integer key | `CAST(DATE_FORMAT(..., '%Y%m%d') AS UNSIGNED)` in extract SQL |
|
||||
| Medal → medal_points (Gold=3, Silver=2, Bronze=1) | `CASE` expression in extract SQL |
|
||||
| Boolean → 0/1 | `CASE WHEN ... = 1 THEN 1 ELSE 0 END` in extract SQL |
|
||||
| Natural key → surrogate key | Sub-SELECT in Oracle load SQL |
|
||||
| Duration calculation fallback | `COALESCE(s.duration_days, DATEDIFF(...) + 1)` in extract SQL |
|
||||
| NULL handling for optional fields | NiFi Expression Language `:isEmpty():ifElse(...)` in load SQL |
|
||||
Reference in New Issue
Block a user