This commit is contained in:
2026-05-17 21:27:42 +02:00
parent 718407d709
commit 571c749e25
5 changed files with 875 additions and 0 deletions

255
docs/03-datamart.md Normal file
View File

@@ -0,0 +1,255 @@
# Data Mart — Design & Details
## Overview
The data mart uses a **star schema** stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns.
- **Schema type:** Star schema
- **Dimensions:** 6 (+ date dimension)
- **Fact table:** `FACT_ROOM_BOOKING`
- **Grain:** One row per room_booking (one room, one stay)
- **SCD strategy:** Type 2 on DIM_HOTEL, Type 1 on all others
---
## Star Schema Diagram
```
DIM_DATE
(date_key)
┌───────────┴───────────┐
│ checkin / checkout │
│ │
DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING
│ │
│ FACT_ROOM_BOOKING ◄──── DIM_ROOM
│ │
└───────► DIM_COUNTRY ◄───── DIM_GUEST
```
---
## Dimension Tables
### DIM_DATE
Populated once for the range 20202030. Used for both check-in and check-out date lookups.
| Column | Type | Description |
|--------|------|-------------|
| `date_key` | NUMBER(8) PK | YYYYMMDD integer key |
| `full_date` | DATE | Actual date value |
| `year` | NUMBER(4) | |
| `quarter` | NUMBER(1) | 14 |
| `month` | NUMBER(2) | 112 |
| `month_name` | VARCHAR2(10) | e.g. `January` |
| `week_number` | NUMBER(2) | ISO week number |
| `day_of_month` | NUMBER(2) | |
| `day_name` | VARCHAR2(10) | e.g. `Monday` |
| `is_weekend` | NUMBER(1) | 0/1 |
| `is_business_day` | NUMBER(1) | 0/1 |
| `season` | VARCHAR2(10) | Peak / High / Autumn / Winter |
Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: `checkin_date_key BETWEEN 20240601 AND 20240831`.
---
### DIM_COUNTRY (SCD Type 1)
Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed).
| Column | Type | Description |
|--------|------|-------------|
| `country_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `country_id` | NUMBER(10) UNIQUE | Natural key from MySQL |
| `code` | CHAR(2) | ISO alpha-2 |
| `name` | VARCHAR2(100) | |
| `currency` | VARCHAR2(10) | ISO currency code |
---
### DIM_STAR_RATING (SCD Type 1)
Static lookup. Star rating codes 15 never change.
| Column | Type | Description |
|--------|------|-------------|
| `star_rating_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `star_rating_id` | NUMBER(10) UNIQUE | Natural key |
| `code` | NUMBER(1) | 15 |
| `description` | VARCHAR2(20) | e.g. `4 Star` |
---
### DIM_HOTEL_CHAIN (SCD Type 1)
Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes.
| Column | Type | Description |
|--------|------|-------------|
| `hotel_chain_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `hotel_chain_id` | NUMBER(10) UNIQUE | Natural key |
| `code` | VARCHAR2(10) | e.g. `HLT` |
| `name` | VARCHAR2(100) | |
---
### DIM_HOTEL (SCD Type 2)
This is the most analytically significant dimension and the only one implemented as **Slowly Changing Dimension Type 2**.
**Why SCD Type 2 here?**
A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages.
SCD Type 2 preserves history by creating a **new row** for each version of a hotel, while expiring the old row with an `expiry_date`. The fact table's `hotel_key` always points to the version that was active **at check-in date**, never to the current version if it changed.
| Column | Type | Description |
|--------|------|-------------|
| `hotel_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `source_hotel_id` | NUMBER(10) | Natural key from MySQL |
| `hotel_chain_key` | NUMBER(10) FK | NULL for independent hotels |
| `country_key` | NUMBER(10) FK | |
| `star_rating_key` | NUMBER(10) FK | |
| `code` | VARCHAR2(20) | |
| `name` | VARCHAR2(150) | |
| `city` | VARCHAR2(100) | |
| `effective_date` | DATE | When this version became active |
| `expiry_date` | DATE | When this version was superseded (NULL = current) |
| `is_current` | NUMBER(1) | 1 = current version |
**SCD2 example:**
| hotel_key | source_hotel_id | star_rating | effective_date | expiry_date | is_current |
|-----------|----------------|-------------|----------------|-------------|-----------|
| 1 | 42 | 3★ | 2022-01-01 | 2024-05-31 | 0 |
| 2 | 42 | 4★ | 2024-06-01 | NULL | 1 |
Bookings from 20222024 point to `hotel_key=1`, bookings from 2024 onward point to `hotel_key=2`. Revenue by star category remains historically correct.
---
### DIM_ROOM (SCD Type 1)
Room type is stable for our dataset. Updated via MERGE if room details ever change.
| Column | Type | Description |
|--------|------|-------------|
| `room_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `room_id` | NUMBER(10) UNIQUE | Natural key |
| `hotel_key` | NUMBER(10) FK | Points to current DIM_HOTEL version |
| `room_number` | VARCHAR2(10) | |
| `floor` | NUMBER(3) | |
| `room_type_code` | VARCHAR2(20) | e.g. `SUITE` |
| `room_type_desc` | VARCHAR2(100) | |
| `smoking_yn` | NUMBER(1) | |
| `standard_rate` | NUMBER(10,2) | Base rate from OLTP |
---
### DIM_GUEST (SCD Type 1)
Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used.
| Column | Type | Description |
|--------|------|-------------|
| `guest_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `guest_id` | NUMBER(10) UNIQUE | Natural key |
| `country_key` | NUMBER(10) FK | Home country |
| `name` | VARCHAR2(150) | |
| `city` | VARCHAR2(100) | |
---
## Fact Table: FACT_ROOM_BOOKING
**Grain:** One row per room_booking — one specific room, for one stay.
| Column | Type | Description |
|--------|------|-------------|
| `fact_id` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `source_rb_id` | NUMBER(10) UNIQUE | Natural key — used for idempotent incremental loads |
| `hotel_key` | NUMBER(10) FK | SCD2-resolved hotel version at check-in |
| `hotel_chain_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
| `room_key` | NUMBER(10) FK | |
| `guest_key` | NUMBER(10) FK | |
| `country_key` | NUMBER(10) FK | Guest's country — denormalized |
| `star_rating_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
| `checkin_date_key` | NUMBER(8) FK | YYYYMMDD |
| `checkout_date_key` | NUMBER(8) FK | YYYYMMDD |
| `booking_status` | VARCHAR2(20) | Degenerate dimension: confirmed/completed/cancelled/no_show |
| `nights_stayed` | NUMBER(4) | checkout checkin in days |
| `nightly_rate` | NUMBER(10,2) | Rate per night at time of booking |
| `total_amount` | NUMBER(12,2) | `nightly_rate × nights_stayed` |
### Measures
| Measure | Type | Aggregation |
|---------|------|-------------|
| `nights_stayed` | Additive | SUM, AVG |
| `nightly_rate` | Semi-additive | AVG (not SUM — rate doesn't add across rooms meaningfully) |
| `total_amount` | Additive | SUM (main revenue measure) |
### Degenerate Dimensions
`booking_status` is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead.
---
## ETL Control Tables
### ETL_WATERMARK
Tracks the highest `room_booking_id` already loaded into the fact table, enabling incremental loads without re-reading the entire source.
| Column | Description |
|--------|-------------|
| `entity_name` | Logical entity name (e.g. `FACT_ROOM_BOOKING`) |
| `last_key` | Highest PK value loaded so far |
| `last_run_ts` | Timestamp of the last ETL run |
### STG_HOTEL
Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run.
---
## Sample Analytical Queries
### Revenue by country and quarter
```sql
SELECT
c.name AS country,
d.year,
d.quarter,
SUM(f.total_amount) AS revenue,
COUNT(*) AS room_nights
FROM FACT_ROOM_BOOKING f
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
JOIN DIM_GUEST g ON g.guest_key = f.guest_key
JOIN DIM_COUNTRY c ON c.country_key = g.country_key
WHERE f.booking_status = 'completed'
GROUP BY c.name, d.year, d.quarter
ORDER BY revenue DESC;
```
### Average revenue per star category (correct because of SCD2)
```sql
SELECT
sr.code AS stars,
d.season,
AVG(f.nightly_rate) AS avg_nightly_rate,
SUM(f.total_amount) AS total_revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
GROUP BY sr.code, d.season
ORDER BY sr.code, d.season;
```
### Top 10 cities by occupancy (room-nights)
```sql
SELECT
h.city,
SUM(f.nights_stayed) AS room_nights,
SUM(f.total_amount) AS revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
WHERE f.booking_status IN ('completed','confirmed')
GROUP BY h.city
ORDER BY room_nights DESC
FETCH FIRST 10 ROWS ONLY;
```