From 571c749e25d3a6460632c467569f93a9b5bd57f0 Mon Sep 17 00:00:00 2001 From: StewKI Date: Sun, 17 May 2026 21:27:42 +0200 Subject: [PATCH] docs --- docs/01-overview.md | 98 ++++++++++++++++ docs/02-oltp.md | 258 ++++++++++++++++++++++++++++++++++++++++++ docs/03-datamart.md | 255 +++++++++++++++++++++++++++++++++++++++++ docs/04-setup.md | 181 +++++++++++++++++++++++++++++ docs/05-conclusion.md | 83 ++++++++++++++ 5 files changed, 875 insertions(+) create mode 100644 docs/01-overview.md create mode 100644 docs/02-oltp.md create mode 100644 docs/03-datamart.md create mode 100644 docs/04-setup.md create mode 100644 docs/05-conclusion.md diff --git a/docs/01-overview.md b/docs/01-overview.md new file mode 100644 index 0000000..af82013 --- /dev/null +++ b/docs/01-overview.md @@ -0,0 +1,98 @@ +# Hotel Reservations — Data Warehouse Project + +## Project Summary + +This project implements a complete **Data Warehousing pipeline** for a hotel reservation system, covering all standard DW layers: + +``` +MySQL OLTP ──► Apache NiFi ETL ──► Oracle Data Mart ──► Power BI Reports +(source) (transform) (analytical store) (OLAP queries) +``` + +The system is built around the **A.24 Hotel Reservations** domain from the course specification. The OLTP database was populated with **~635,000 synthetically generated rows** covering 200 hotels, 100,000 guests, 500,000 bookings, and 531,000 room bookings across a 4-year period (2022–2025). + +--- + +## Business Context + +A hotel chain needs to answer questions like: + +- Which countries generate the most revenue per quarter? +- How does occupancy differ between peak and off-peak seasons? +- What is the revenue contribution of 5-star vs 3-star hotels? +- How has a hotel's revenue changed after upgrading its star rating? + +These questions require **historical, multi-dimensional analysis** that a normalized OLTP database cannot serve efficiently. The data mart provides pre-modelled, denormalized data optimized for analytical queries. + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────┐ +│ SOURCE LAYER │ +│ MySQL 8.4 (Docker/Podman, port 13306) │ +│ Database: hotel_reservations │ +│ 13 normalized tables, ~635K rows │ +└───────────────────────┬─────────────────────────────────┘ + │ JDBC (MySqlConnector) + ▼ +┌─────────────────────────────────────────────────────────┐ +│ ETL LAYER │ +│ Apache NiFi │ +│ 5 Process Groups: Date Dim / Static Dims / │ +│ SCD2 Hotel / SCD1 Guest / Incremental Fact │ +└───────────────────────┬─────────────────────────────────┘ + │ JDBC (Oracle JDBC) + ▼ +┌─────────────────────────────────────────────────────────┐ +│ DATA MART LAYER │ +│ Oracle (university lab schema) │ +│ Star schema: 6 dimensions + 1 fact table │ +│ SCD Type 2 on DIM_HOTEL │ +└───────────────────────┬─────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ PRESENTATION LAYER │ +│ Power BI Desktop │ +│ OLAP reports via DirectQuery / Import │ +└─────────────────────────────────────────────────────────┘ +``` + +--- + +## Technology Stack + +| Component | Technology | Version | +|-----------|-----------|---------| +| OLTP Database | MySQL | 8.4 | +| Container runtime | Docker / Podman | — | +| Data generator | C# (.NET) | 10 | +| ETL tool | Apache NiFi | — | +| Data Mart | Oracle RDBMS | university lab | +| Reporting | Power BI Desktop | — | + +--- + +## Repository Structure + +``` +IPZ_1/ +├── docker/ +│ ├── start.sh # Start MySQL container (Linux/macOS) +│ ├── stop.sh # Stop MySQL container +│ ├── start.ps1 # Start MySQL container (Windows) +│ └── stop.ps1 # Stop MySQL container +├── sql/ +│ ├── schema.sql # MySQL OLTP DDL +│ └── datamart_schema.sql # Oracle Data Mart DDL +├── generator/ +│ └── generate.cs # .NET 10 data generator script +└── docs/ + ├── 01-overview.md # This file + ├── 02-oltp.md # OLTP database design + ├── 03-datamart.md # Data mart design + ├── 04-setup.md # Setup and run guide + └── nifi-flow.md # NiFi ETL flow reference +``` diff --git a/docs/02-oltp.md b/docs/02-oltp.md new file mode 100644 index 0000000..234e09f --- /dev/null +++ b/docs/02-oltp.md @@ -0,0 +1,258 @@ +# OLTP Database — Design & Details + +## Overview + +The OLTP (Online Transaction Processing) database models a **hotel reservation system** using a fully normalized relational schema in **MySQL 8.4**. It follows 3NF and enforces referential integrity via foreign keys. + +- **Database:** `hotel_reservations` +- **Character set:** `utf8mb4` / `utf8mb4_unicode_ci` +- **Tables:** 13 +- **Total rows:** ~635,000 + +--- + +## Entity-Relationship Model + +The schema covers five entity groups: + +``` +hotel_chain ──┐ +country ───────┼──► hotel ──► hotel_room ──► room_booking ──► booking ──► guest +star_rating ──┘ │ + └──► country +hotel_characteristic ◄──► hotel (M:N via hotel_hotel_characteristic) + +room_type ◄──── hotel_room +room_type ◄──┐ +rate_period ◄─┴── period_room_rate (price per room type per season) +``` + +--- + +## Table Descriptions + +### Reference / Lookup Tables + +#### `hotel_chain` +International hotel chains (Hilton, Marriott, Accor, etc.). + +| Column | Type | Description | +|--------|------|-------------| +| `hotel_chain_id` | INT UNSIGNED PK | Surrogate key | +| `code` | VARCHAR(10) UNIQUE | Short code (e.g. `HLT`) | +| `name` | VARCHAR(100) | Full name | + +**Rows:** 10 + +--- + +#### `country` +Countries from which guests come and where hotels are located. + +| Column | Type | Description | +|--------|------|-------------| +| `country_id` | INT UNSIGNED PK | Surrogate key | +| `code` | CHAR(2) UNIQUE | ISO 3166-1 alpha-2 (e.g. `GB`) | +| `name` | VARCHAR(100) | Country name | +| `currency` | VARCHAR(10) | ISO currency code (e.g. `EUR`) | + +**Rows:** 40 (Europe, Americas, Asia, Africa, Oceania) + +--- + +#### `star_rating` +Hotel classification from 1★ to 5★. + +| Column | Type | Description | +|--------|------|-------------| +| `star_rating_id` | INT UNSIGNED PK | Surrogate key | +| `code` | TINYINT UNIQUE | 1–5 | +| `description` | VARCHAR(20) | e.g. `3 Star` | + +**Rows:** 5 + +--- + +#### `hotel_characteristic` +Amenities and features a hotel may offer. + +| Column | Type | Description | +|--------|------|-------------| +| `characteristic_id` | INT UNSIGNED PK | Surrogate key | +| `code` | VARCHAR(20) UNIQUE | e.g. `POOL`, `SPA`, `WIFI` | +| `description` | VARCHAR(100) | Human-readable label | + +**Rows:** 12 (WiFi, Pool, Gym, Spa, Restaurant, Bar, Parking, Valet, Conference, Shuttle, Room Service, Pet Friendly) + +--- + +#### `room_type` +Types of rooms a hotel can offer, with a standard (base) rate. + +| Column | Type | Description | +|--------|------|-------------| +| `room_type_id` | INT UNSIGNED PK | Surrogate key | +| `code` | VARCHAR(20) UNIQUE | e.g. `SINGLE`, `SUITE` | +| `description` | VARCHAR(100) | e.g. `Junior Suite` | +| `standard_rate` | DECIMAL(10,2) | Base nightly rate (EUR) | +| `smoking_yn` | BOOLEAN | Smoking allowed flag | + +**Rows:** 7 (Single €80, Double €120, Twin €115, Deluxe €180, Suite €280, Executive €450, Family €200) + +--- + +#### `rate_period` +Seasonal pricing periods. Each period maps to a month range and applies a rate multiplier. + +| Column | Type | Description | +|--------|------|-------------| +| `rate_period_id` | INT UNSIGNED PK | Surrogate key | +| `code` | VARCHAR(20) UNIQUE | e.g. `PEAK`, `WINTER` | +| `description` | VARCHAR(50) | Human-readable label | +| `month_from` | TINYINT | Start month (1–12) | +| `month_to` | TINYINT | End month (1–12) | + +**Rows:** 4 + +| Code | Period | Months | Multiplier | +|------|--------|--------|-----------| +| PEAK | Peak Season | Jun–Aug | ×1.5 | +| HIGH | High Season | Mar–May | ×1.2 | +| AUTUMN | Autumn Season | Sep–Nov | ×1.1 | +| WINTER | Winter Season | Dec–Feb | ×0.9 | + +--- + +### Junction Tables + +#### `period_room_rate` +The effective nightly rate for each (room_type, rate_period) combination. +Rate = `standard_rate × season_multiplier`. + +| Column | Type | Description | +|--------|------|-------------| +| `room_type_id` | INT UNSIGNED PK/FK | | +| `rate_period_id` | INT UNSIGNED PK/FK | | +| `rate` | DECIMAL(10,2) | Effective nightly rate | + +**Rows:** 28 (7 room types × 4 seasons) + +--- + +#### `hotel_hotel_characteristic` +M:N junction between hotels and their amenities. + +| Column | Type | +|--------|------| +| `hotel_id` | INT UNSIGNED PK/FK | +| `characteristic_id` | INT UNSIGNED PK/FK | + +**Rows:** ~1,415 + +--- + +### Core Entity Tables + +#### `hotel` +Individual hotel properties. + +| Column | Type | Description | +|--------|------|-------------| +| `hotel_id` | INT UNSIGNED PK | | +| `hotel_chain_id` | INT UNSIGNED FK | NULL for independent hotels | +| `country_id` | INT UNSIGNED FK | | +| `star_rating_id` | INT UNSIGNED FK | | +| `code` | VARCHAR(20) UNIQUE | e.g. `HTL0001` | +| `name` | VARCHAR(150) | | +| `address` | VARCHAR(200) | | +| `postcode` | VARCHAR(20) | | +| `city` | VARCHAR(100) | | +| `url` | VARCHAR(200) | | + +**Rows:** 200 (50 cities, star distribution: 5% 1★, 10% 2★, 40% 3★, 30% 4★, 15% 5★) + +--- + +#### `hotel_room` +Individual rooms within each hotel. + +| Column | Type | Description | +|--------|------|-------------| +| `room_id` | INT UNSIGNED PK | | +| `hotel_id` | INT UNSIGNED FK | | +| `room_type_id` | INT UNSIGNED FK | | +| `room_number` | VARCHAR(10) | Format: `{floor}{number}`, e.g. `101` | +| `floor` | TINYINT UNSIGNED | | + +**Rows:** 5,334 (5–60 rooms per hotel depending on star rating) + +--- + +#### `guest` +Hotel guests. + +| Column | Type | Description | +|--------|------|-------------| +| `guest_id` | INT UNSIGNED PK | | +| `country_id` | INT UNSIGNED FK | Guest's home country | +| `name` | VARCHAR(150) | Full name | +| `email` | VARCHAR(150) | Unique synthetic email | +| `address` | VARCHAR(200) | | +| `city` | VARCHAR(100) | | + +**Rows:** 100,000 + +--- + +#### `booking` +A reservation made by a guest at a hotel. One booking can cover multiple rooms. + +| Column | Type | Description | +|--------|------|-------------| +| `booking_id` | INT UNSIGNED PK | | +| `guest_id` | INT UNSIGNED FK | | +| `hotel_id` | INT UNSIGNED FK | | +| `date_from` | DATE | Check-in | +| `date_to` | DATE | Check-out | +| `status` | ENUM | `confirmed`, `cancelled`, `completed`, `no_show` | +| `created_at` | DATETIME | When booking was made | + +**Rows:** 500,000 +**Status distribution:** 80% completed, 10% confirmed, 7% cancelled, 3% no_show +**Date range:** 2022-01-01 – 2025-12-31 +**Seasonal distribution:** June–August heaviest (peak), December–February lightest + +--- + +#### `room_booking` +A specific room assigned within a booking. Stores the rate **as it was at booking time** (snapshot), independent of any future rate changes. + +| Column | Type | Description | +|--------|------|-------------| +| `room_booking_id` | INT UNSIGNED PK | | +| `booking_id` | INT UNSIGNED FK | | +| `room_id` | INT UNSIGNED FK | | +| `date_from` | DATE | | +| `date_to` | DATE | | +| `nightly_rate` | DECIMAL(10,2) | Rate at time of booking | +| `total_amount` | DECIMAL(10,2) | `nightly_rate × nights` | + +**Rows:** 531,382 +**Room count per booking:** 90% single room, 8% two rooms, 2% three rooms + +--- + +## Data Generation + +The database was populated using a **single-file C# script** (`generator/generate.cs`) running on .NET 10, using `MySqlConnector` as the only dependency. + +Key generation decisions: +- **Seasonal booking distribution** via rejection sampling — months Jun–Aug are ~2.7× more likely than Jan–Feb +- **Rate snapshot** — each `room_booking.nightly_rate` is looked up from `period_room_rate` at insert time and stored, not re-computed later +- **Realistic stay lengths** — 30% one night, 25% two nights, 20% three nights, tapering off to 14-night stays +- **Cancelled/no-show bookings** partially skip room assignment (60% of cancellations have no room_booking) + +```bash +# Run generator +dotnet run generator/generate.cs +``` diff --git a/docs/03-datamart.md b/docs/03-datamart.md new file mode 100644 index 0000000..6f05a05 --- /dev/null +++ b/docs/03-datamart.md @@ -0,0 +1,255 @@ +# Data Mart — Design & Details + +## Overview + +The data mart uses a **star schema** stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns. + +- **Schema type:** Star schema +- **Dimensions:** 6 (+ date dimension) +- **Fact table:** `FACT_ROOM_BOOKING` +- **Grain:** One row per room_booking (one room, one stay) +- **SCD strategy:** Type 2 on DIM_HOTEL, Type 1 on all others + +--- + +## Star Schema Diagram + +``` + DIM_DATE + (date_key) + │ + ┌───────────┴───────────┐ + │ checkin / checkout │ + │ │ +DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING + │ │ + │ FACT_ROOM_BOOKING ◄──── DIM_ROOM + │ │ + └───────► DIM_COUNTRY ◄───── DIM_GUEST +``` + +--- + +## Dimension Tables + +### DIM_DATE +Populated once for the range 2020–2030. Used for both check-in and check-out date lookups. + +| Column | Type | Description | +|--------|------|-------------| +| `date_key` | NUMBER(8) PK | YYYYMMDD integer key | +| `full_date` | DATE | Actual date value | +| `year` | NUMBER(4) | | +| `quarter` | NUMBER(1) | 1–4 | +| `month` | NUMBER(2) | 1–12 | +| `month_name` | VARCHAR2(10) | e.g. `January` | +| `week_number` | NUMBER(2) | ISO week number | +| `day_of_month` | NUMBER(2) | | +| `day_name` | VARCHAR2(10) | e.g. `Monday` | +| `is_weekend` | NUMBER(1) | 0/1 | +| `is_business_day` | NUMBER(1) | 0/1 | +| `season` | VARCHAR2(10) | Peak / High / Autumn / Winter | + +Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: `checkin_date_key BETWEEN 20240601 AND 20240831`. + +--- + +### DIM_COUNTRY (SCD Type 1) +Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed). + +| Column | Type | Description | +|--------|------|-------------| +| `country_key` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `country_id` | NUMBER(10) UNIQUE | Natural key from MySQL | +| `code` | CHAR(2) | ISO alpha-2 | +| `name` | VARCHAR2(100) | | +| `currency` | VARCHAR2(10) | ISO currency code | + +--- + +### DIM_STAR_RATING (SCD Type 1) +Static lookup. Star rating codes 1–5 never change. + +| Column | Type | Description | +|--------|------|-------------| +| `star_rating_key` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `star_rating_id` | NUMBER(10) UNIQUE | Natural key | +| `code` | NUMBER(1) | 1–5 | +| `description` | VARCHAR2(20) | e.g. `4 Star` | + +--- + +### DIM_HOTEL_CHAIN (SCD Type 1) +Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes. + +| Column | Type | Description | +|--------|------|-------------| +| `hotel_chain_key` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `hotel_chain_id` | NUMBER(10) UNIQUE | Natural key | +| `code` | VARCHAR2(10) | e.g. `HLT` | +| `name` | VARCHAR2(100) | | + +--- + +### DIM_HOTEL (SCD Type 2) + +This is the most analytically significant dimension and the only one implemented as **Slowly Changing Dimension Type 2**. + +**Why SCD Type 2 here?** + +A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages. + +SCD Type 2 preserves history by creating a **new row** for each version of a hotel, while expiring the old row with an `expiry_date`. The fact table's `hotel_key` always points to the version that was active **at check-in date**, never to the current version if it changed. + +| Column | Type | Description | +|--------|------|-------------| +| `hotel_key` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `source_hotel_id` | NUMBER(10) | Natural key from MySQL | +| `hotel_chain_key` | NUMBER(10) FK | NULL for independent hotels | +| `country_key` | NUMBER(10) FK | | +| `star_rating_key` | NUMBER(10) FK | | +| `code` | VARCHAR2(20) | | +| `name` | VARCHAR2(150) | | +| `city` | VARCHAR2(100) | | +| `effective_date` | DATE | When this version became active | +| `expiry_date` | DATE | When this version was superseded (NULL = current) | +| `is_current` | NUMBER(1) | 1 = current version | + +**SCD2 example:** + +| hotel_key | source_hotel_id | star_rating | effective_date | expiry_date | is_current | +|-----------|----------------|-------------|----------------|-------------|-----------| +| 1 | 42 | 3★ | 2022-01-01 | 2024-05-31 | 0 | +| 2 | 42 | 4★ | 2024-06-01 | NULL | 1 | + +Bookings from 2022–2024 point to `hotel_key=1`, bookings from 2024 onward point to `hotel_key=2`. Revenue by star category remains historically correct. + +--- + +### DIM_ROOM (SCD Type 1) +Room type is stable for our dataset. Updated via MERGE if room details ever change. + +| Column | Type | Description | +|--------|------|-------------| +| `room_key` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `room_id` | NUMBER(10) UNIQUE | Natural key | +| `hotel_key` | NUMBER(10) FK | Points to current DIM_HOTEL version | +| `room_number` | VARCHAR2(10) | | +| `floor` | NUMBER(3) | | +| `room_type_code` | VARCHAR2(20) | e.g. `SUITE` | +| `room_type_desc` | VARCHAR2(100) | | +| `smoking_yn` | NUMBER(1) | | +| `standard_rate` | NUMBER(10,2) | Base rate from OLTP | + +--- + +### DIM_GUEST (SCD Type 1) +Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used. + +| Column | Type | Description | +|--------|------|-------------| +| `guest_key` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `guest_id` | NUMBER(10) UNIQUE | Natural key | +| `country_key` | NUMBER(10) FK | Home country | +| `name` | VARCHAR2(150) | | +| `city` | VARCHAR2(100) | | + +--- + +## Fact Table: FACT_ROOM_BOOKING + +**Grain:** One row per room_booking — one specific room, for one stay. + +| Column | Type | Description | +|--------|------|-------------| +| `fact_id` | NUMBER(10) PK | Surrogate (IDENTITY) | +| `source_rb_id` | NUMBER(10) UNIQUE | Natural key — used for idempotent incremental loads | +| `hotel_key` | NUMBER(10) FK | SCD2-resolved hotel version at check-in | +| `hotel_chain_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience | +| `room_key` | NUMBER(10) FK | | +| `guest_key` | NUMBER(10) FK | | +| `country_key` | NUMBER(10) FK | Guest's country — denormalized | +| `star_rating_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience | +| `checkin_date_key` | NUMBER(8) FK | YYYYMMDD | +| `checkout_date_key` | NUMBER(8) FK | YYYYMMDD | +| `booking_status` | VARCHAR2(20) | Degenerate dimension: confirmed/completed/cancelled/no_show | +| `nights_stayed` | NUMBER(4) | checkout − checkin in days | +| `nightly_rate` | NUMBER(10,2) | Rate per night at time of booking | +| `total_amount` | NUMBER(12,2) | `nightly_rate × nights_stayed` | + +### Measures + +| Measure | Type | Aggregation | +|---------|------|-------------| +| `nights_stayed` | Additive | SUM, AVG | +| `nightly_rate` | Semi-additive | AVG (not SUM — rate doesn't add across rooms meaningfully) | +| `total_amount` | Additive | SUM (main revenue measure) | + +### Degenerate Dimensions +`booking_status` is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead. + +--- + +## ETL Control Tables + +### ETL_WATERMARK +Tracks the highest `room_booking_id` already loaded into the fact table, enabling incremental loads without re-reading the entire source. + +| Column | Description | +|--------|-------------| +| `entity_name` | Logical entity name (e.g. `FACT_ROOM_BOOKING`) | +| `last_key` | Highest PK value loaded so far | +| `last_run_ts` | Timestamp of the last ETL run | + +### STG_HOTEL +Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run. + +--- + +## Sample Analytical Queries + +### Revenue by country and quarter +```sql +SELECT + c.name AS country, + d.year, + d.quarter, + SUM(f.total_amount) AS revenue, + COUNT(*) AS room_nights +FROM FACT_ROOM_BOOKING f +JOIN DIM_DATE d ON d.date_key = f.checkin_date_key +JOIN DIM_GUEST g ON g.guest_key = f.guest_key +JOIN DIM_COUNTRY c ON c.country_key = g.country_key +WHERE f.booking_status = 'completed' +GROUP BY c.name, d.year, d.quarter +ORDER BY revenue DESC; +``` + +### Average revenue per star category (correct because of SCD2) +```sql +SELECT + sr.code AS stars, + d.season, + AVG(f.nightly_rate) AS avg_nightly_rate, + SUM(f.total_amount) AS total_revenue +FROM FACT_ROOM_BOOKING f +JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key +JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key +JOIN DIM_DATE d ON d.date_key = f.checkin_date_key +GROUP BY sr.code, d.season +ORDER BY sr.code, d.season; +``` + +### Top 10 cities by occupancy (room-nights) +```sql +SELECT + h.city, + SUM(f.nights_stayed) AS room_nights, + SUM(f.total_amount) AS revenue +FROM FACT_ROOM_BOOKING f +JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key +WHERE f.booking_status IN ('completed','confirmed') +GROUP BY h.city +ORDER BY room_nights DESC +FETCH FIRST 10 ROWS ONLY; +``` diff --git a/docs/04-setup.md b/docs/04-setup.md new file mode 100644 index 0000000..c80e1bf --- /dev/null +++ b/docs/04-setup.md @@ -0,0 +1,181 @@ +# Setup Guide + +## Prerequisites + +| Tool | Required for | Notes | +|------|-------------|-------| +| Docker or Podman | MySQL container | Use `--podman` flag on Linux | +| .NET 10 SDK | Data generator | `dotnet run file.cs` support | +| Apache NiFi | ETL | Running instance with Oracle + MySQL JDBC drivers | +| Oracle JDBC driver | NiFi | `ojdbc11.jar` in NiFi's lib directory | +| MySQL JDBC driver | NiFi | `mysql-connector-j-*.jar` in NiFi's lib directory | +| Oracle DB access | Data mart target | University lab credentials | + +--- + +## Step 1 — Start MySQL Container + +**Linux / macOS (Docker):** +```bash +bash docker/start.sh +``` + +**Linux / macOS (Podman):** +```bash +bash docker/start.sh --podman +``` + +**Windows (PowerShell):** +```powershell +.\docker\start.ps1 +``` + +The script: +- Creates a named container `hotel-mysql` with a persistent data volume +- Mounts `sql/schema.sql` as an init script — all 13 tables are created automatically on first start +- Waits until MySQL is ready before exiting + +**Connection details:** +``` +Host: 127.0.0.1 +Port: 13306 +Database: hotel_reservations +User: root +Password: hotel2025root +``` + +--- + +## Step 2 — Generate OLTP Data + +```bash +dotnet run generator/generate.cs +``` + +**Runtime:** ~3 minutes +**Output:** 635,000+ rows across 13 tables + +The generator is deterministic (fixed seed `42`) — running it twice on an empty database produces the same data. + +> **Important:** Run the generator only once on an empty database. If you need to restart, truncate all tables first (respecting FK order) or drop and recreate the container + volume. + +### Quick table verification after generation: +```bash +# Docker +docker exec hotel-mysql mysql -uroot -photel2025root hotel_reservations \ + -e "SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema='hotel_reservations';" + +# Podman +podman exec hotel-mysql mysql -uroot -photel2025root hotel_reservations \ + -e "SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema='hotel_reservations';" +``` + +--- + +## Step 3 — Prepare Oracle Data Mart + +Connect to the Oracle schema (university lab) and execute `sql/datamart_schema.sql`. + +The script creates: +- `ETL_WATERMARK` (with initial row for `FACT_ROOM_BOOKING`) +- `STG_HOTEL` (staging) +- All 7 dimension tables +- `FACT_ROOM_BOOKING` + +```sql +-- Run in SQL*Plus or SQL Developer: +@datamart_schema.sql +``` + +--- + +## Step 4 — Configure NiFi + +### 4.1 Add JDBC drivers to NiFi + +Copy the following JARs to `$NIFI_HOME/lib/` (or the NiFi extensions directory): +- `mysql-connector-j-8.x.jar` +- `ojdbc11.jar` + +Restart NiFi after adding drivers. + +### 4.2 Create Controller Services + +In NiFi UI → Controller Settings → Controller Services: + +**MySQL connection:** +- Type: `DBCPConnectionPool` +- Database Driver Class Name: `com.mysql.cj.jdbc.Driver` +- Database Connection URL: `jdbc:mysql://127.0.0.1:13306/hotel_reservations` +- Database User: `root` +- Password: `hotel2025root` + +**Oracle connection:** +- Type: `DBCPConnectionPool` +- Database Driver Class Name: `oracle.jdbc.OracleDriver` +- Database Connection URL: `jdbc:oracle:thin:@:1521:` +- Database User: `` +- Password: `` + +Enable both services. + +### 4.3 Build Process Groups + +Follow the detailed processor configuration in `docs/nifi-flow.md`. + +**Recommended build order:** +1. PG-1: Date Dimension (simplest, test first) +2. PG-2: Static Dimensions (verify MERGE logic) +3. PG-3: DIM_HOTEL SCD2 (most complex — check staging table after run) +4. PG-4: DIM_GUEST SCD1 +5. PG-5: Fact Incremental Load + +--- + +## Step 5 — Run ETL + +### First full load + +1. Run **PG-1** (Date Dimension) manually — run once +2. Start **PG-2, PG-3, PG-4** — these are idempotent, safe to re-run +3. Start **PG-5** — runs incrementally; first run loads all 531k room_bookings + +### Verify load + +```sql +-- Oracle +SELECT COUNT(*) FROM DIM_HOTEL; -- should be 200 (+ more after SCD2 changes) +SELECT COUNT(*) FROM DIM_GUEST; -- 100,000 +SELECT COUNT(*) FROM FACT_ROOM_BOOKING; -- 531,382 +SELECT last_key FROM ETL_WATERMARK WHERE entity_name = 'FACT_ROOM_BOOKING'; -- 531,382 +``` + +### Verify SCD2 is working + +```sql +-- Should show 1 current version per hotel on initial load +SELECT is_current, COUNT(*) FROM DIM_HOTEL GROUP BY is_current; +-- Expected: IS_CURRENT=1, COUNT=200 +``` + +--- + +## Stop / Restart + +**Stop MySQL (preserves data):** +```bash +bash docker/stop.sh [--podman] +``` + +**Restart MySQL:** +```bash +bash docker/start.sh [--podman] +``` + +**Full reset (delete all data):** +```bash +bash docker/stop.sh --podman +podman volume rm hotel-mysql-data +bash docker/start.sh --podman +dotnet run generator/generate.cs +``` diff --git a/docs/05-conclusion.md b/docs/05-conclusion.md new file mode 100644 index 0000000..1ccdc9c --- /dev/null +++ b/docs/05-conclusion.md @@ -0,0 +1,83 @@ +# Conclusion + +## What Was Built + +This project delivers a complete, working **Data Warehouse pipeline** for the Hotel Reservations domain: + +| Layer | What was built | Scale | +|-------|---------------|-------| +| OLTP | MySQL 8.4, 13-table normalized schema | ~635,000 rows | +| Data generation | .NET 10 C# script, realistic seasonal distribution | 500K bookings in ~3 min | +| ETL | Apache NiFi, 5 process groups | full + incremental loads | +| Data Mart | Oracle star schema, SCD Type 2 on DIM_HOTEL | 1 fact + 6 dims | + +--- + +## Design Decisions + +### Synthetic data generation instead of a Kaggle dataset + +The decision to generate data rather than use a pre-existing dataset was deliberate. Publicly available hotel datasets are either too small (thousands of rows) or lack the normalized relational structure needed to demonstrate a realistic OLTP-to-DW pipeline. The generator produces statistically realistic data: + +- Seasonal booking distribution (summer peak, winter trough) +- Realistic stay-length distribution (30% one-night stays) +- Varied status distribution (80% completed, 10% confirmed, 7% cancelled, 3% no-show) +- Revenue rates tied to actual seasonal pricing periods + +### SCD Type 2 on DIM_HOTEL only + +SCD Type 2 adds operational complexity — it requires staging tables, a two-phase SQL update, and SCD2-aware fact inserts. Applying it to every dimension would make the ETL unnecessarily complex for the analytical benefit gained. + +DIM_HOTEL is the right candidate because: +- Star rating changes (3★→4★ after renovation) directly affect revenue benchmarks +- Chain affiliation changes (hotel joins or leaves a franchise) affect chain-level reporting +- Tracking these historically is the core value proposition of dimensional modelling + +Guests, countries, room types, and hotel chains all change rarely or in ways that don't affect historical analysis — SCD Type 1 (overwrite) is appropriate. + +### Watermark-based incremental fact loading + +The fact table uses `source_rb_id` (the MySQL `room_booking_id`) as a natural key and applies a `NOT EXISTS` guard on every insert. Combined with the `ETL_WATERMARK` table, this makes PG-5 both **incremental** (only processes new rows) and **idempotent** (safe to re-run without creating duplicates). This pattern is production-standard and would scale cleanly to a real operational system. + +### Integer date keys in DIM_DATE + +`date_key` is stored as `NUMBER(8)` in YYYYMMDD format rather than a FK to a DATE column. This allows: +- Fast range predicates: `WHERE checkin_date_key BETWEEN 20240601 AND 20240831` +- No JOIN to get the date value when it's used directly in GROUP BY +- Human-readable values in query results without formatting + +--- + +## Analytical Capabilities + +The data mart enables the following categories of OLAP queries: + +**Revenue analysis:** +- Total revenue by country, city, hotel chain, star category +- Revenue trend over time (monthly, quarterly, yearly) +- Revenue split by booking status and room type + +**Occupancy analysis:** +- Room-nights sold per hotel, per season +- Average stay duration by guest country +- Cancellation rates by period and hotel category + +**SCD2-specific analysis:** +- Compare revenue performance of hotels before and after star rating upgrade +- Identify which hotel version (chain affiliation) was more profitable + +**Guest origin analysis:** +- Which countries generate the most bookings and revenue +- Cross-country booking patterns (guest country vs hotel country) + +--- + +## Limitations and Possible Extensions + +| Limitation | Possible extension | +|------------|-------------------| +| Static OLTP data (no live updates) | Add a NiFi timer to simulate ongoing bookings | +| No SCD2 on DIM_ROOM | Add room type tracking for renovation analysis | +| Single fact table | Add a second fact table for daily hotel occupancy (snapshot fact) | +| No data quality checks in NiFi | Add RouteOnAttribute + dead-letter queue for failed records | +| Oracle target is university lab | Package with Oracle XE Docker container for self-contained demo |