Docs init
This commit is contained in:
83
docs/01_overview.md
Normal file
83
docs/01_overview.md
Normal file
@@ -0,0 +1,83 @@
|
|||||||
|
# Project Overview
|
||||||
|
|
||||||
|
## What This Project Is
|
||||||
|
|
||||||
|
This project builds a complete data warehousing pipeline for the **Esports World Cup 2025 (EWC 2025)** — the world's largest esports event, held in Riyadh, Saudi Arabia across 27 tournaments from July to August 2025 with a total prize pool exceeding $100 million.
|
||||||
|
|
||||||
|
The goal is to take raw event data, load it into a structured transactional database, and then transform it into a Data Mart optimized for analytical reporting. The final output is a star schema in Oracle that can be connected to Power BI to answer business questions about prize distribution, country performance, club rankings, and more.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technology Stack
|
||||||
|
|
||||||
|
| Layer | Tool |
|
||||||
|
|---|---|
|
||||||
|
| Source data | Kaggle CSV dataset (10 files) |
|
||||||
|
| OLTP database | MySQL 8.4 (Docker container) |
|
||||||
|
| ETL pipeline | Apache NiFi |
|
||||||
|
| Data Mart | Oracle (university lab schema) |
|
||||||
|
| Reporting | Microsoft Power BI |
|
||||||
|
| Infrastructure | Docker / Podman |
|
||||||
|
| Seed script | .NET 10 (single-file C# script) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ Kaggle CSV files │ 10 files, ~700 rows total
|
||||||
|
│ (./data/) │
|
||||||
|
└────────┬────────────┘
|
||||||
|
│ dotnet run ./scripts/seed.cs
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ MySQL 8.4 OLTP │ Normalized relational schema
|
||||||
|
│ port 13306 │ 14 tables, 3NF
|
||||||
|
│ (Docker) │
|
||||||
|
└────────┬────────────┘
|
||||||
|
│ Apache NiFi ETL
|
||||||
|
│ ExecuteSQL → ConvertAvroToJSON → SplitJson
|
||||||
|
│ → EvaluateJsonPath → PutSQL
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ Oracle Data Mart │ Star schema
|
||||||
|
│ (university lab) │ 3 fact tables, 5 dimension tables
|
||||||
|
└────────┬────────────┘
|
||||||
|
│ Import / Live connection
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ Power BI Reports │ OLAP analytics, 2 dashboards
|
||||||
|
└─────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
IPZ_1/
|
||||||
|
├── data/ Raw Kaggle CSV files (source data)
|
||||||
|
├── sql/
|
||||||
|
│ ├── schema.sql MySQL OLTP schema DDL
|
||||||
|
│ └── datamart_schema.sql Oracle Data Mart DDL
|
||||||
|
├── scripts/
|
||||||
|
│ └── seed.cs .NET 10 script to populate MySQL from CSVs
|
||||||
|
├── docker/
|
||||||
|
│ ├── start.sh / stop.sh Linux (Docker or Podman)
|
||||||
|
│ └── start.ps1 / stop.ps1 Windows
|
||||||
|
├── nifi/
|
||||||
|
│ ├── sql/extract/ MySQL queries (one per ETL pipeline)
|
||||||
|
│ ├── sql/load/ Oracle INSERT statements (one per ETL pipeline)
|
||||||
|
│ └── NIFI_SETUP.md Step-by-step NiFi configuration guide
|
||||||
|
└── docs/ This documentation
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Flow Summary
|
||||||
|
|
||||||
|
1. **Raw data** lives as 10 CSV files exported from Kaggle covering EWC 2025.
|
||||||
|
2. **Seeding** — a single C# script reads all CSVs, resolves foreign key relationships, and populates the MySQL OLTP database in the correct order.
|
||||||
|
3. **ETL** — Apache NiFi runs 8 pipelines. Each reads from MySQL, extracts records, and inserts rows into Oracle dimension and fact tables.
|
||||||
|
4. **Reporting** — Power BI connects to Oracle and queries the star schema for OLAP analysis.
|
||||||
50
docs/02_dataset.md
Normal file
50
docs/02_dataset.md
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
# Dataset
|
||||||
|
|
||||||
|
## Source
|
||||||
|
|
||||||
|
The data comes from a Kaggle dataset titled **"Esports World Cup 2025 — Complete Dataset"**, released under the CC BY 4.0 license. It was compiled from publicly available tournament results, official brackets, and club partnership announcements.
|
||||||
|
|
||||||
|
The dataset covers all 27 title tournaments of EWC 2025, held at Boulevard City, Riyadh, Saudi Arabia from **July 8 to August 24, 2025**. The event featured a **$100M+ total prize pool** and competitors from **36+ countries**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
| File | Rows | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `01_EWC2025_Event_Tournament_Summary.csv` | 27 | One row per tournament — dates, prize pool, winner, game type |
|
||||||
|
| `02_EWC2025_Medalists.csv` | 257 | Every gold/silver/bronze medalist with country and role |
|
||||||
|
| `03_EWC2025_Club_Championship_Standings.csv` | 24 | Final Club Championship rankings — points and prize money |
|
||||||
|
| `04_EWC2025_Club_Partner_Program.csv` | 40 | Official partner organizations — region, founding year, social following |
|
||||||
|
| `05_EWC2025_Player_Roster.csv` | 272 | Full player roster — age, experience, prize earned, social followers |
|
||||||
|
| `06_EWC2025_Prize_Pool_Distribution.csv` | 5 | How the $100M+ prize pool was split across categories |
|
||||||
|
| `07_EWC2025_Calendar_Schedule.csv` | 27 | Weekly tournament schedule with venue and timezone |
|
||||||
|
| `08_EWC2025_Country_Results.csv` | 36 | Medal tally by country with player counts |
|
||||||
|
| `09_EWC2025_Point_System.csv` | 16 | Club Championship point system by placement |
|
||||||
|
| `10_EWC2025_Game_by_Game_Results.csv` | 50 | Match results — scores, map, duration, MVP |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Entities in the Data
|
||||||
|
|
||||||
|
**Games** — 25 unique titles spanning 8 genres (MOBA, FPS, Battle Royale, Fighting, RTS, Sports, Auto Battler, Strategy) across PC, mobile, and console platforms.
|
||||||
|
|
||||||
|
**Organizations** — 60+ esports clubs. 40 are official EWC Club Partners with full metadata (founding year, HQ, social following). The rest appear through match results and medalist records.
|
||||||
|
|
||||||
|
**Players** — 272 players in the roster with demographics (age, country, region), performance (tournament placement, prize earned), and social media reach.
|
||||||
|
|
||||||
|
**Tournaments** — 27 events, including two with gender divisions (Mobile Legends: Bang Bang ran separate Men and Women brackets) and two with format variants (Naraka: Bladepoint ran Solo and Trios simultaneously).
|
||||||
|
|
||||||
|
**Club Championship** — A meta-competition running across all 27 tournaments. Clubs accumulate points based on their placements in each event. The top 24 clubs share a $27M prize pool.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Quality Notes
|
||||||
|
|
||||||
|
A few things to be aware of when working with this data:
|
||||||
|
|
||||||
|
- **Game name inconsistencies across files.** Files 07, 02, and 10 use expanded names like `"Mobile Legends: Bang Bang - Men"` and `"Naraka: Bladepoint - Solo"`, while file 01 uses the base game name with a separate Gender column. The seed script handles this normalization automatically.
|
||||||
|
- **Abbreviated game names in file 04.** The `Games_Competing` column uses shorthand like `"Mobile Legends"` instead of `"Mobile Legends: Bang Bang"`, and `"PUBG"` which is ambiguous. These are resolved via an alias map in the seed script.
|
||||||
|
- **Mixed winner types in file 01.** For individual-format games (Chess, StarCraft II, Street Fighter 6, Tekken 8, EA Sports FC 25), the `Winner` column contains a player name rather than a team name. This is why `winner` and `runner_up` are stored as plain text in the OLTP rather than as foreign keys.
|
||||||
|
- **`Battlegrounds Mobile India`** appears in file 04 as a game one organization competes in, but it is not present as an EWC 2025 tournament. This entry is skipped during seeding.
|
||||||
|
- **Prize earnings for players** (file 05) are all zero in the dataset, likely because individual prize splits were not publicly available at time of compilation.
|
||||||
140
docs/03_oltp_schema.md
Normal file
140
docs/03_oltp_schema.md
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
# OLTP Database
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The OLTP (Online Transaction Processing) database is a normalized relational schema implemented in **MySQL 8.4**. It serves as the authoritative source of record for all EWC 2025 data and as the source for the ETL pipeline that populates the Data Mart.
|
||||||
|
|
||||||
|
The schema is in **Third Normal Form (3NF)** — no transitive dependencies, no repeating groups, every non-key attribute depends on the whole key.
|
||||||
|
|
||||||
|
The DDL is in `sql/schema.sql`. The database runs locally in Docker on port **13306**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tables
|
||||||
|
|
||||||
|
### Lookup Tables
|
||||||
|
|
||||||
|
These have no foreign keys and are loaded first.
|
||||||
|
|
||||||
|
**`game`** — The 25 unique game titles that appear at EWC 2025.
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| game_id | INT UNSIGNED | Auto-increment PK |
|
||||||
|
| name | VARCHAR(100) | Unique |
|
||||||
|
| game_type | VARCHAR(50) | MOBA, FPS, Battle Royale, Fighting, etc. |
|
||||||
|
| platform | VARCHAR(50) | PC, Mobile, Console/PC, etc. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`country`** — All countries represented at the event, enriched with medal tallies from file 08.
|
||||||
|
|
||||||
|
| Column | Type | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| country_id | INT UNSIGNED | Auto-increment PK |
|
||||||
|
| name | VARCHAR(100) | Unique |
|
||||||
|
| region | VARCHAR(50) | Asia, Europe, North America, etc. |
|
||||||
|
| gold_medals | TINYINT UNSIGNED | From file 08 (0 if not in file 08) |
|
||||||
|
| silver_medals | TINYINT UNSIGNED | |
|
||||||
|
| bronze_medals | TINYINT UNSIGNED | |
|
||||||
|
| total_medals | TINYINT UNSIGNED | |
|
||||||
|
| total_players | SMALLINT UNSIGNED | |
|
||||||
|
| top_game | VARCHAR(100) | Game the country performed best in |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`point_system`** — The Club Championship scoring table (16 rows covering placements 1–8 under Standard and Co-Placement rules).
|
||||||
|
|
||||||
|
**`prize_pool_category`** — The 5 high-level prize pool categories (Game Championships, Club Championship, Qualifiers, MVP Awards, Club Partner Support).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Core Entities
|
||||||
|
|
||||||
|
**`organization`** — All esports clubs and teams that appear anywhere in the data. 40 rows come from the Club Partner Program file with full metadata; the remaining organizations are inserted with NULL for partner-specific fields.
|
||||||
|
|
||||||
|
| Column | Notes |
|
||||||
|
|---|---|
|
||||||
|
| club_partner_status | ENUM: Current / New / None |
|
||||||
|
| top_8_2024 | Whether the club finished top-8 at EWC 2024 |
|
||||||
|
| social_media_followers_m | Total following in millions |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`tournament`** — One row per tournament event (27 total). The `winner` and `runner_up` columns are stored as plain VARCHAR rather than foreign keys to `organization` because individual-format games (Chess, StarCraft II, etc.) list a player name as the winner, not a team.
|
||||||
|
|
||||||
|
| Column | Notes |
|
||||||
|
|---|---|
|
||||||
|
| gender | ENUM: Open / Men / Women |
|
||||||
|
| club_championship_points | Whether this tournament awarded Club Championship points |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`schedule`** — A 1:1 extension of `tournament` holding the schedule metadata from file 07 (week number, venue, timezone, duration). Kept separate to avoid widening the tournament row.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`player`** — 272 players from the official roster. Uses the natural key from the dataset (`EWC2025_001`, etc.) as the primary key rather than an auto-increment, since the source data provides stable identifiers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`medalist`** — One row per player-medal. A player who wins Gold contributes one row. A five-player team winning Gold contributes five rows. This is the most granular performance record in the OLTP.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`match_result`** — 50 match records from file 10. The `team_1`, `team_2`, and `winner` columns are VARCHAR for the same reason as `tournament.winner` — individual-format games list player names here.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**`club_championship_standing`** — Final standings for the 24 clubs that earned Club Championship points. 1:1 with `organization`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Junction Tables
|
||||||
|
|
||||||
|
Two multi-valued columns from the source data are normalized into junction tables:
|
||||||
|
|
||||||
|
**`organization_game_competing`** — Resolves the comma-separated `Games_Competing` column from the Club Partner Program (e.g. `"Dota 2, Chess, EA Sports FC 25, Counter-Strike 2"`).
|
||||||
|
|
||||||
|
**`organization_game_won`** — Resolves the comma-separated `Games_Won` column from the Club Championship standings.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Entity Relationships
|
||||||
|
|
||||||
|
```
|
||||||
|
game ──────────────────────────────────────────────┐
|
||||||
|
│ │
|
||||||
|
├──► tournament ──► schedule │
|
||||||
|
│ │ │
|
||||||
|
│ ├──► medalist ◄── organization ◄────────┤
|
||||||
|
│ │ └──► country │
|
||||||
|
│ └──► match_result │
|
||||||
|
│ │
|
||||||
|
└──► player ◄── organization │
|
||||||
|
└──► country │
|
||||||
|
│
|
||||||
|
organization ──► club_championship_standing │
|
||||||
|
│ │
|
||||||
|
├──► organization_game_competing ────────┘
|
||||||
|
└──► organization_game_won ──────────────┘
|
||||||
|
|
||||||
|
point_system (standalone lookup)
|
||||||
|
prize_pool_category (standalone lookup)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Decisions
|
||||||
|
|
||||||
|
**Why is `country` a table rather than a VARCHAR column?**
|
||||||
|
Country appears in players, medalists, and organizations. Storing it as a table avoids duplicating the region attribute and allows medal stats (from file 08) to be joined in without repeating them on every player row.
|
||||||
|
|
||||||
|
**Why does `tournament.winner` stay as VARCHAR?**
|
||||||
|
Enforcing a foreign key to `organization` would require creating dummy organization rows for individual players like "Magnus Carlsen" or "Serral". That would pollute the organization table with data that isn't an organization. The clean solution is to keep it as text and resolve it at query time when needed.
|
||||||
|
|
||||||
|
**Why is `schedule` a separate table from `tournament`?**
|
||||||
|
A 1:1 split is justified here because the schedule data comes from a completely different source file (file 07) and is conceptually distinct — it describes the logistics of the event, not the competitive outcome. Keeping it separate makes the ETL cleaner and the tournament table less wide.
|
||||||
|
|
||||||
|
**Why use `player_id VARCHAR(20)` instead of AUTO_INCREMENT?**
|
||||||
|
The source dataset provides stable IDs (`EWC2025_001` through `EWC2025_272`). Using the natural key preserves traceability back to the source without adding a meaningless surrogate.
|
||||||
198
docs/04_datamart.md
Normal file
198
docs/04_datamart.md
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
# Data Mart
|
||||||
|
|
||||||
|
## What a Data Mart Is
|
||||||
|
|
||||||
|
A Data Mart is a database optimized for reading and analysis rather than for recording transactions. While the OLTP schema is normalized to avoid redundancy, the Data Mart is deliberately denormalized into a **star schema** — a central fact table surrounded by dimension tables — so that analytical queries are fast and simple to write.
|
||||||
|
|
||||||
|
In a star schema:
|
||||||
|
- **Fact tables** hold measurable events with numeric metrics (prize money, medal count, points)
|
||||||
|
- **Dimension tables** hold descriptive context that you slice and filter by (game type, country, organization region)
|
||||||
|
|
||||||
|
The Data Mart is stored in the **Oracle university lab schema** and populated by Apache NiFi reading from the MySQL OLTP.
|
||||||
|
|
||||||
|
The DDL is in `sql/datamart_schema.sql`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dimensions
|
||||||
|
|
||||||
|
### DIM_DATE
|
||||||
|
|
||||||
|
A standard calendar dimension covering every date in the EWC 2025 event window (July 8 – August 24, 2025). Using a dedicated date dimension allows Power BI to filter by week, group by month, or compare by quarter with no extra calculation.
|
||||||
|
|
||||||
|
| Column | Example |
|
||||||
|
|---|---|
|
||||||
|
| date_key | 20250708 (YYYYMMDD integer) |
|
||||||
|
| full_date | 2025-07-08 |
|
||||||
|
| year | 2025 |
|
||||||
|
| quarter | 3 |
|
||||||
|
| month / month_name | 7 / July |
|
||||||
|
| week_number | 28 |
|
||||||
|
| day_of_month / day_name | 8 / Tuesday |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### DIM_GAME
|
||||||
|
|
||||||
|
Describes each of the 25 game titles. Enables slicing facts by genre (MOBA vs FPS vs Battle Royale) and by platform (PC vs Mobile vs Console).
|
||||||
|
|
||||||
|
| Column | Example |
|
||||||
|
|---|---|
|
||||||
|
| name | Counter-Strike 2 |
|
||||||
|
| game_type | FPS |
|
||||||
|
| platform | PC |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### DIM_COUNTRY
|
||||||
|
|
||||||
|
Countries with their geographic region. Intentionally kept lean — the medal counts that live in the OLTP `country` table are not carried into this dimension because they are derived facts, not descriptive attributes.
|
||||||
|
|
||||||
|
| Column | Example |
|
||||||
|
|---|---|
|
||||||
|
| name | South Korea |
|
||||||
|
| region | Asia |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### DIM_ORGANIZATION
|
||||||
|
|
||||||
|
All esports clubs and teams. Includes partner metadata to enable analysis by partner tier (Current partner vs non-partner) and social reach.
|
||||||
|
|
||||||
|
| Column | Example |
|
||||||
|
|---|---|
|
||||||
|
| name | Team Falcons |
|
||||||
|
| region | Middle East |
|
||||||
|
| country | Saudi Arabia |
|
||||||
|
| club_partner_status | Current |
|
||||||
|
| founded_year | 2017 |
|
||||||
|
| social_media_followers_m | 4.0 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### DIM_MEDAL
|
||||||
|
|
||||||
|
A simple three-row table representing the medal types. Includes `medal_rank` (1/2/3) so reports can sort Gold → Silver → Bronze correctly without relying on alphabetical ordering.
|
||||||
|
|
||||||
|
| medal_type | medal_rank |
|
||||||
|
|---|---|
|
||||||
|
| Gold | 1 |
|
||||||
|
| Silver | 2 |
|
||||||
|
| Bronze | 3 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fact Tables
|
||||||
|
|
||||||
|
### FACT_TOURNAMENT
|
||||||
|
|
||||||
|
**Grain:** one row per tournament (27 rows).
|
||||||
|
|
||||||
|
This is the primary financial fact table. It answers questions about prize money distribution across games, genres, platforms, and time.
|
||||||
|
|
||||||
|
| Column | Type | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| game_key | FK → DIM_GAME | What game |
|
||||||
|
| start_date_key | FK → DIM_DATE | When it started |
|
||||||
|
| end_date_key | FK → DIM_DATE | When it ended |
|
||||||
|
| winner_org_key | FK → DIM_ORGANIZATION | Winning organization (NULL for individual-winner events) |
|
||||||
|
| event_name | text | Degenerate dimension |
|
||||||
|
| gender | text | Open / Men / Women |
|
||||||
|
| **prize_pool_usd** | measure | Total prize pool in USD |
|
||||||
|
| **num_participants** | measure | Number of competing teams/players |
|
||||||
|
| **duration_days** | measure | Tournament length in days |
|
||||||
|
| **has_club_points** | measure | 1 if tournament awarded Club Championship points |
|
||||||
|
|
||||||
|
**Example questions this enables:**
|
||||||
|
- What was the total prize money awarded to MOBA tournaments vs FPS tournaments?
|
||||||
|
- Which platform (PC or Mobile) had higher average prize pools?
|
||||||
|
- How did prize pools vary across the 6-week event?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### FACT_MEDAL_AWARD
|
||||||
|
|
||||||
|
**Grain:** one row per player-medal (257 rows).
|
||||||
|
|
||||||
|
This fact table captures individual competitive performance. Each medalist player contributes one row with a `medal_count` of 1 and a `medal_points` of 3/2/1. Both columns are additive — you can SUM them freely to get team medal totals, country medal totals, etc.
|
||||||
|
|
||||||
|
| Column | Type | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| game_key | FK → DIM_GAME | Game the medal was won in |
|
||||||
|
| medal_key | FK → DIM_MEDAL | Gold / Silver / Bronze |
|
||||||
|
| country_key | FK → DIM_COUNTRY | Player's nationality |
|
||||||
|
| org_key | FK → DIM_ORGANIZATION | Player's team |
|
||||||
|
| date_key | FK → DIM_DATE | Tournament start date |
|
||||||
|
| player_name | text | Degenerate dimension |
|
||||||
|
| **medal_count** | measure | Always 1 — additive for totals |
|
||||||
|
| **medal_points** | measure | Gold=3, Silver=2, Bronze=1 |
|
||||||
|
|
||||||
|
**Example questions this enables:**
|
||||||
|
- Which country won the most medals overall? By region?
|
||||||
|
- Which game genre produced the most medals for Asian countries?
|
||||||
|
- Which organization accumulated the most medal points across all events?
|
||||||
|
- Did South Korea dominate PC games while Southeast Asia dominated mobile games?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### FACT_CLUB_STANDING
|
||||||
|
|
||||||
|
**Grain:** one row per club in the Club Championship (24 rows). This is a snapshot — it represents the final standings at the end of EWC 2025.
|
||||||
|
|
||||||
|
| Column | Type | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| org_key | FK → DIM_ORGANIZATION | The club |
|
||||||
|
| **final_rank** | measure | Final position (1 = best) |
|
||||||
|
| **total_points** | measure | Total Club Championship points earned |
|
||||||
|
| **prize_money_usd** | measure | Prize money from Club Championship |
|
||||||
|
| **tournament_wins** | measure | Number of tournaments the club won |
|
||||||
|
| **top_8_finishes** | measure | Total top-8 tournament finishes |
|
||||||
|
| **eligible_to_win** | measure | 1 if the club was eligible for the grand prize |
|
||||||
|
|
||||||
|
**Example questions this enables:**
|
||||||
|
- How does prize money correlate with tournament wins vs breadth of top-8 finishes?
|
||||||
|
- Do Middle Eastern clubs outperform European clubs in the Club Championship?
|
||||||
|
- What is the average total_points for Current club partners vs non-partners?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Star Schema Diagram
|
||||||
|
|
||||||
|
```
|
||||||
|
DIM_DATE
|
||||||
|
┌──────────┐
|
||||||
|
│ date_key │
|
||||||
|
└────┬─────┘
|
||||||
|
│ start/end
|
||||||
|
│
|
||||||
|
DIM_GAME ────────── FACT_TOURNAMENT ────────── DIM_ORGANIZATION
|
||||||
|
(game_key) (prize_pool_usd (org_key)
|
||||||
|
num_participants
|
||||||
|
duration_days
|
||||||
|
has_club_points)
|
||||||
|
|
||||||
|
|
||||||
|
DIM_COUNTRY ──┐
|
||||||
|
DIM_ORGAN. ──┼── FACT_MEDAL_AWARD ──── DIM_GAME
|
||||||
|
DIM_MEDAL ──┘ (medal_count (game_key)
|
||||||
|
DIM_DATE ─────┘ medal_points)
|
||||||
|
|
||||||
|
|
||||||
|
DIM_ORGANIZATION ── FACT_CLUB_STANDING
|
||||||
|
(total_points
|
||||||
|
prize_money_usd
|
||||||
|
tournament_wins
|
||||||
|
top_8_finishes)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why Three Fact Tables
|
||||||
|
|
||||||
|
A single fact table would require choosing one grain, which would make some analyses awkward or impossible.
|
||||||
|
|
||||||
|
- `FACT_TOURNAMENT` is at tournament grain — you cannot get per-player medal counts from it.
|
||||||
|
- `FACT_MEDAL_AWARD` is at player-medal grain — you cannot get prize pool totals from it without denormalizing tournament data into it.
|
||||||
|
- `FACT_CLUB_STANDING` captures a snapshot that has no natural place in the other two tables.
|
||||||
|
|
||||||
|
Keeping them separate means each fact table has a clean, single grain. Power BI can build relationships between them through the shared dimensions.
|
||||||
135
docs/05_etl_pipeline.md
Normal file
135
docs/05_etl_pipeline.md
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
# ETL Pipeline
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The ETL (Extract, Transform, Load) pipeline is built in **Apache NiFi** and moves data from the MySQL OLTP database into the Oracle Data Mart. It runs 8 sequential pipelines — one per target table — each following the same processor chain.
|
||||||
|
|
||||||
|
For the detailed step-by-step NiFi configuration (which buttons to click, which properties to set), see `nifi/NIFI_SETUP.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Two Phases Before NiFi
|
||||||
|
|
||||||
|
### Phase 1 — Seed the OLTP (one-time)
|
||||||
|
|
||||||
|
Before NiFi runs, the MySQL database must be populated. This is done by the C# seed script:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./docker/start.sh # start MySQL container
|
||||||
|
dotnet run ./scripts/seed.cs
|
||||||
|
```
|
||||||
|
|
||||||
|
The script reads all 10 CSV files, resolves foreign key relationships (game names, country lookups, organization cross-references), and inserts everything in the correct dependency order.
|
||||||
|
|
||||||
|
### Phase 2 — Create the Data Mart schema in Oracle (one-time)
|
||||||
|
|
||||||
|
Run `sql/datamart_schema.sql` against your Oracle lab schema before the first NiFi run. This creates all 8 tables. Oracle SQL Developer or any SQL client works for this.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## NiFi Processor Chain
|
||||||
|
|
||||||
|
Every pipeline follows this identical 5-processor pattern:
|
||||||
|
|
||||||
|
```
|
||||||
|
ExecuteSQL ConvertAvroToJSON SplitJson
|
||||||
|
(MySQL source) ──► (Avro → JSON) ──► (array → 1 FlowFile per row)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
EvaluateJsonPath
|
||||||
|
(JSON fields → FlowFile attributes)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
PutSQL
|
||||||
|
(Oracle target)
|
||||||
|
```
|
||||||
|
|
||||||
|
| Processor | Role |
|
||||||
|
|---|---|
|
||||||
|
| **ExecuteSQL** | Runs the extract SQL on MySQL, produces Avro-encoded records |
|
||||||
|
| **ConvertAvroToJSON** | Converts the Avro binary to a JSON array |
|
||||||
|
| **SplitJson** | Splits the JSON array into one FlowFile per record |
|
||||||
|
| **EvaluateJsonPath** | Reads each field from the JSON record and stores it as a named FlowFile attribute |
|
||||||
|
| **PutSQL** | Runs the Oracle INSERT statement, substituting `${attribute}` placeholders with the FlowFile attribute values |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The 8 Pipelines
|
||||||
|
|
||||||
|
Pipelines must run in this order because facts depend on all dimensions being loaded first.
|
||||||
|
|
||||||
|
### Dimensions (run first, in any order among themselves)
|
||||||
|
|
||||||
|
| # | Pipeline | Extract source | Rows |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 1 | DIM_DATE | MySQL CTE (generates date range) | 48 |
|
||||||
|
| 2 | DIM_MEDAL | No extract — 3 static rows inserted directly to Oracle | 3 |
|
||||||
|
| 3 | DIM_GAME | `game` table | 25 |
|
||||||
|
| 4 | DIM_COUNTRY | `country` table | 36+ |
|
||||||
|
| 5 | DIM_ORGANIZATION | `organization` JOIN `country` | 60+ |
|
||||||
|
|
||||||
|
### Facts (run after all dimensions)
|
||||||
|
|
||||||
|
| # | Pipeline | Extract source | Rows |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 6 | FACT_TOURNAMENT | `tournament` JOIN `schedule` JOIN `organization` | 27 |
|
||||||
|
| 7 | FACT_MEDAL_AWARD | `medalist` JOIN `tournament` | 257 |
|
||||||
|
| 8 | FACT_CLUB_STANDING | `club_championship_standing` | 24 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How Key Resolution Works
|
||||||
|
|
||||||
|
The OLTP stores natural keys (e.g. `game_id = 3`). The Oracle Data Mart uses surrogate keys generated by `GENERATED ALWAYS AS IDENTITY` (e.g. `game_key = 3`). These can differ if there is ever a gap or reorder.
|
||||||
|
|
||||||
|
The fact load SQL handles this by embedding a sub-SELECT inside each `INSERT...SELECT...FROM DUAL`:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
INSERT INTO FACT_TOURNAMENT (game_key, ...)
|
||||||
|
SELECT
|
||||||
|
(SELECT game_key FROM DIM_GAME WHERE game_id = ${game_id}),
|
||||||
|
...
|
||||||
|
FROM DUAL
|
||||||
|
```
|
||||||
|
|
||||||
|
This means the Oracle database itself resolves the surrogate key at insert time, using the natural key that was extracted from MySQL and carried through as a FlowFile attribute. No transformation processor is needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SQL File Layout
|
||||||
|
|
||||||
|
```
|
||||||
|
nifi/sql/
|
||||||
|
├── extract/
|
||||||
|
│ ├── 01_dim_date.sql Recursive CTE to generate calendar rows
|
||||||
|
│ ├── 02_dim_game.sql Simple SELECT from game
|
||||||
|
│ ├── 03_dim_country.sql Simple SELECT from country
|
||||||
|
│ ├── 04_dim_organization.sql SELECT with LEFT JOIN to country for country name
|
||||||
|
│ ├── 05_fact_tournament.sql JOIN with schedule and org; computes date keys
|
||||||
|
│ ├── 06_fact_medal_award.sql JOIN with tournament; computes medal_points
|
||||||
|
│ └── 07_fact_club_standing.sql
|
||||||
|
└── load/
|
||||||
|
├── 01_dim_date.sql INSERT with TO_DATE conversion for Oracle
|
||||||
|
├── 02_dim_medal.sql Static 3-row INSERT, run once directly
|
||||||
|
├── 03_dim_game.sql
|
||||||
|
├── 04_dim_country.sql
|
||||||
|
├── 05_dim_organization.sql NULL-safe EL expressions for optional fields
|
||||||
|
├── 06_fact_tournament.sql Sub-SELECT key lookups via game_id, org_id
|
||||||
|
├── 07_fact_medal_award.sql Sub-SELECT key lookups; medal_type lookup
|
||||||
|
└── 08_fact_club_standing.sql Sub-SELECT key lookup via org_id
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Transformations Performed
|
||||||
|
|
||||||
|
Most of the "transformation" work happens in the extract SQL rather than in NiFi processors, which keeps the NiFi flow simple.
|
||||||
|
|
||||||
|
| Transformation | Where it happens |
|
||||||
|
|---|---|
|
||||||
|
| Date → YYYYMMDD integer key | `CAST(DATE_FORMAT(..., '%Y%m%d') AS UNSIGNED)` in extract SQL |
|
||||||
|
| Medal → medal_points (Gold=3, Silver=2, Bronze=1) | `CASE` expression in extract SQL |
|
||||||
|
| Boolean → 0/1 | `CASE WHEN ... = 1 THEN 1 ELSE 0 END` in extract SQL |
|
||||||
|
| Natural key → surrogate key | Sub-SELECT in Oracle load SQL |
|
||||||
|
| Duration calculation fallback | `COALESCE(s.duration_days, DATEDIFF(...) + 1)` in extract SQL |
|
||||||
|
| NULL handling for optional fields | NiFi Expression Language `:isEmpty():ifElse(...)` in load SQL |
|
||||||
109
docs/06_infrastructure.md
Normal file
109
docs/06_infrastructure.md
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
# Infrastructure
|
||||||
|
|
||||||
|
## MySQL Container
|
||||||
|
|
||||||
|
The MySQL 8.4 OLTP database runs in Docker (or Podman) for easy setup and teardown without requiring a local MySQL installation.
|
||||||
|
|
||||||
|
### Starting and stopping
|
||||||
|
|
||||||
|
**Linux (Docker):**
|
||||||
|
```bash
|
||||||
|
./docker/start.sh # start
|
||||||
|
./docker/stop.sh # stop (data is preserved)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Linux (Podman):**
|
||||||
|
```bash
|
||||||
|
./docker/start.sh --podman
|
||||||
|
./docker/stop.sh --podman
|
||||||
|
```
|
||||||
|
|
||||||
|
**Windows (PowerShell):**
|
||||||
|
```powershell
|
||||||
|
./docker/start.ps1
|
||||||
|
./docker/stop.ps1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Connection details
|
||||||
|
|
||||||
|
| Property | Value |
|
||||||
|
|---|---|
|
||||||
|
| Host | 127.0.0.1 |
|
||||||
|
| Port | **13306** (non-standard to avoid conflicts with any local MySQL) |
|
||||||
|
| Database | ewc2025 |
|
||||||
|
| User | root |
|
||||||
|
| Password | ewc2025root |
|
||||||
|
|
||||||
|
### What happens on first start
|
||||||
|
|
||||||
|
When the container is created for the first time, MySQL automatically executes any `.sql` files found in `/docker-entrypoint-initdb.d/`. The start script mounts `sql/schema.sql` there, so the database schema is created automatically — you do not need to run the DDL manually.
|
||||||
|
|
||||||
|
### Data persistence
|
||||||
|
|
||||||
|
Container data is stored in a named Docker/Podman volume (`ewc2025-mysql-data`). Stopping and restarting the container does not lose any data.
|
||||||
|
|
||||||
|
To fully reset the database (drop all data and re-create from scratch):
|
||||||
|
```bash
|
||||||
|
docker rm ewc2025-mysql
|
||||||
|
docker volume rm ewc2025-mysql-data
|
||||||
|
./docker/start.sh # fresh container, empty schema
|
||||||
|
dotnet run ./scripts/seed.cs
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Seed Script
|
||||||
|
|
||||||
|
The seed script (`scripts/seed.cs`) is a single-file **.NET 10 C# script** — no project file or solution needed.
|
||||||
|
|
||||||
|
### Requirements
|
||||||
|
|
||||||
|
- .NET 10 SDK installed
|
||||||
|
- MySQL container running
|
||||||
|
|
||||||
|
### Running it
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet run ./scripts/seed.cs
|
||||||
|
```
|
||||||
|
|
||||||
|
The script can be run from any directory — it walks up the directory tree to find the `data/` folder automatically.
|
||||||
|
|
||||||
|
### What it does
|
||||||
|
|
||||||
|
Reads all 10 CSV files and inserts data into MySQL in this order:
|
||||||
|
|
||||||
|
1. `game` — deduplicates game names from file 01
|
||||||
|
2. `country` — from file 08; additional countries auto-inserted as encountered
|
||||||
|
3. `point_system` — from file 09
|
||||||
|
4. `prize_pool_category` — from file 06
|
||||||
|
5. `organization` — collects all org names across files 02–05, 10; enriches with file 04 details
|
||||||
|
6. `tournament` — from file 01
|
||||||
|
7. `schedule` — from file 07; matched to tournaments by game name + start date
|
||||||
|
8. `player` — from file 05
|
||||||
|
9. `medalist` — from file 02
|
||||||
|
10. `match_result` — from file 10
|
||||||
|
11. `club_championship_standing` — from file 03
|
||||||
|
12. `organization_game_competing` — from file 04 (multi-value column split)
|
||||||
|
13. `organization_game_won` — from file 03 (multi-value column split)
|
||||||
|
|
||||||
|
Expected output on a clean database:
|
||||||
|
```
|
||||||
|
Connected to MySQL.
|
||||||
|
[1/13] game
|
||||||
|
[2/13] country
|
||||||
|
...
|
||||||
|
[13/13] organization_game_won
|
||||||
|
Done. Database seeded.
|
||||||
|
```
|
||||||
|
|
||||||
|
A `WARN:` line is printed for `Battlegrounds Mobile India` (not an EWC 2025 tournament — expected and harmless).
|
||||||
|
|
||||||
|
### NuGet packages used
|
||||||
|
|
||||||
|
| Package | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `MySqlConnector 2.3.7` | MySQL database driver |
|
||||||
|
| `CsvHelper 33.0.1` | CSV parsing (handles quoted fields with commas) |
|
||||||
|
|
||||||
|
These are declared at the top of the script with `#:package` directives and restored automatically by `dotnet run`.
|
||||||
97
docs/07_conclusion.md
Normal file
97
docs/07_conclusion.md
Normal file
@@ -0,0 +1,97 @@
|
|||||||
|
# Conclusion
|
||||||
|
|
||||||
|
## What Was Built
|
||||||
|
|
||||||
|
This project delivers a complete data warehousing pipeline from raw CSV files to an analytics-ready Data Mart:
|
||||||
|
|
||||||
|
| Deliverable | Details |
|
||||||
|
|---|---|
|
||||||
|
| OLTP schema | 14 tables, 3NF, MySQL 8.4 |
|
||||||
|
| Seed script | .NET 10 single-file C# script, loads ~700 rows across all tables |
|
||||||
|
| Docker setup | One-command start/stop for MySQL, Linux and Windows |
|
||||||
|
| Data Mart schema | 3 fact tables, 5 dimension tables, star schema, Oracle DDL |
|
||||||
|
| NiFi ETL | 8 pipelines, extract SQL + load SQL, documented step-by-step |
|
||||||
|
| Documentation | This docs folder |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Analytical Potential
|
||||||
|
|
||||||
|
The Data Mart enables a wide range of OLAP analyses. Below are the most interesting ones, mapped to the fact and dimension tables that support them.
|
||||||
|
|
||||||
|
### Prize Money Distribution
|
||||||
|
|
||||||
|
> *"Where did the $100M go?"*
|
||||||
|
|
||||||
|
Using `FACT_TOURNAMENT` sliced by `DIM_GAME`:
|
||||||
|
- Total prize pool by game genre (MOBA, FPS, Battle Royale, etc.)
|
||||||
|
- Average prize pool per participant by platform (PC vs Mobile)
|
||||||
|
- Prize concentration — what percentage of total prize money went to the top 5 tournaments
|
||||||
|
- Relationship between tournament duration and prize pool size
|
||||||
|
|
||||||
|
### Country and Regional Dominance
|
||||||
|
|
||||||
|
> *"Which part of the world ruled EWC 2025?"*
|
||||||
|
|
||||||
|
Using `FACT_MEDAL_AWARD` sliced by `DIM_COUNTRY` and `DIM_GAME`:
|
||||||
|
- Medal tally by country (gold/silver/bronze breakdown)
|
||||||
|
- Medal points by region (Asia vs Europe vs North America vs Middle East)
|
||||||
|
- Genre specialization — do Asian countries dominate MOBAs? Does Europe lead in FPS?
|
||||||
|
- Countries with most medals per player (efficiency metric using medal_count / total_players)
|
||||||
|
|
||||||
|
### Club Championship Performance
|
||||||
|
|
||||||
|
> *"Which clubs built the best all-around teams?"*
|
||||||
|
|
||||||
|
Using `FACT_CLUB_STANDING` sliced by `DIM_ORGANIZATION`:
|
||||||
|
- Points vs prize money correlation — are points a good predictor of earnings?
|
||||||
|
- Tournament breadth vs depth — do clubs that win fewer tournaments but make more top-8 finishes rank higher?
|
||||||
|
- Performance by region — Middle Eastern clubs (Team Falcons, Twisted Minds) vs European clubs
|
||||||
|
- Club partner ROI — do Current club partners finish higher than non-partners on average?
|
||||||
|
|
||||||
|
### Event Timeline Analysis
|
||||||
|
|
||||||
|
> *"How did the event unfold week by week?"*
|
||||||
|
|
||||||
|
Using `FACT_TOURNAMENT` sliced by `DIM_DATE`:
|
||||||
|
- Prize money at stake each week
|
||||||
|
- Which weeks had the most high-value tournaments running simultaneously
|
||||||
|
- Tournament density (how many events overlapped)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Suggested Power BI Reports
|
||||||
|
|
||||||
|
### Report 1 — Prize & Tournament Analysis
|
||||||
|
|
||||||
|
A financial overview dashboard with:
|
||||||
|
- **KPI cards:** Total prize pool, number of tournaments, average prize per tournament
|
||||||
|
- **Bar chart:** Prize pool by game_type (filter: platform)
|
||||||
|
- **Treemap:** Prize pool breakdown by individual game
|
||||||
|
- **Line chart:** Cumulative prize money awarded by week (using DIM_DATE.week_number)
|
||||||
|
- **Scatter plot:** Prize pool vs num_participants (are bigger tournaments better funded?)
|
||||||
|
|
||||||
|
Slicers: Platform, Gender, Club Championship Points (Yes/No)
|
||||||
|
|
||||||
|
### Report 2 — Performance & Medal Analysis
|
||||||
|
|
||||||
|
A competitive performance dashboard with:
|
||||||
|
- **Map visual:** Medal points by country (filled map using country name)
|
||||||
|
- **Stacked bar chart:** Gold/Silver/Bronze medals by region
|
||||||
|
- **Matrix:** Organizations × Game genres (medal_count as values — shows which orgs are specialists vs all-rounders)
|
||||||
|
- **Bar chart:** Top 10 organizations by total medal_points
|
||||||
|
- **Table:** Club Championship standings with conditional formatting on total_points and prize_money_usd
|
||||||
|
|
||||||
|
Slicers: Region, Game Type, Medal Type
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
**Dataset size** — With only 27 tournaments and 257 medalists, the dataset is small by data warehousing standards. The analyses are valid but a real production data mart would have years of historical data for trend analysis.
|
||||||
|
|
||||||
|
**Player earnings** — The `prize_earned_usd` column in the player roster is zero for all players. Individual prize splits were not publicly available at the time the dataset was compiled, so per-player financial analysis is not possible.
|
||||||
|
|
||||||
|
**Individual vs team events** — Games like Chess, StarCraft II, and the fighting games are individual competitions. Their medal and match data is structured the same way as team events, but the "organization" in those cases is the player's sponsoring team rather than a competing unit. This is a nuance that Power BI visuals should label clearly.
|
||||||
|
|
||||||
|
**Static snapshot** — This is a point-in-time dataset for EWC 2025. The Data Mart has no slowly changing dimension (SCD) logic or historical tracking. It reflects the final state of the event.
|
||||||
Reference in New Issue
Block a user