Files
IPZ_1/docs/03-datamart.md
2026-05-17 21:27:42 +02:00

256 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data Mart — Design & Details
## Overview
The data mart uses a **star schema** stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns.
- **Schema type:** Star schema
- **Dimensions:** 6 (+ date dimension)
- **Fact table:** `FACT_ROOM_BOOKING`
- **Grain:** One row per room_booking (one room, one stay)
- **SCD strategy:** Type 2 on DIM_HOTEL, Type 1 on all others
---
## Star Schema Diagram
```
DIM_DATE
(date_key)
┌───────────┴───────────┐
│ checkin / checkout │
│ │
DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING
│ │
│ FACT_ROOM_BOOKING ◄──── DIM_ROOM
│ │
└───────► DIM_COUNTRY ◄───── DIM_GUEST
```
---
## Dimension Tables
### DIM_DATE
Populated once for the range 20202030. Used for both check-in and check-out date lookups.
| Column | Type | Description |
|--------|------|-------------|
| `date_key` | NUMBER(8) PK | YYYYMMDD integer key |
| `full_date` | DATE | Actual date value |
| `year` | NUMBER(4) | |
| `quarter` | NUMBER(1) | 14 |
| `month` | NUMBER(2) | 112 |
| `month_name` | VARCHAR2(10) | e.g. `January` |
| `week_number` | NUMBER(2) | ISO week number |
| `day_of_month` | NUMBER(2) | |
| `day_name` | VARCHAR2(10) | e.g. `Monday` |
| `is_weekend` | NUMBER(1) | 0/1 |
| `is_business_day` | NUMBER(1) | 0/1 |
| `season` | VARCHAR2(10) | Peak / High / Autumn / Winter |
Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: `checkin_date_key BETWEEN 20240601 AND 20240831`.
---
### DIM_COUNTRY (SCD Type 1)
Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed).
| Column | Type | Description |
|--------|------|-------------|
| `country_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `country_id` | NUMBER(10) UNIQUE | Natural key from MySQL |
| `code` | CHAR(2) | ISO alpha-2 |
| `name` | VARCHAR2(100) | |
| `currency` | VARCHAR2(10) | ISO currency code |
---
### DIM_STAR_RATING (SCD Type 1)
Static lookup. Star rating codes 15 never change.
| Column | Type | Description |
|--------|------|-------------|
| `star_rating_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `star_rating_id` | NUMBER(10) UNIQUE | Natural key |
| `code` | NUMBER(1) | 15 |
| `description` | VARCHAR2(20) | e.g. `4 Star` |
---
### DIM_HOTEL_CHAIN (SCD Type 1)
Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes.
| Column | Type | Description |
|--------|------|-------------|
| `hotel_chain_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `hotel_chain_id` | NUMBER(10) UNIQUE | Natural key |
| `code` | VARCHAR2(10) | e.g. `HLT` |
| `name` | VARCHAR2(100) | |
---
### DIM_HOTEL (SCD Type 2)
This is the most analytically significant dimension and the only one implemented as **Slowly Changing Dimension Type 2**.
**Why SCD Type 2 here?**
A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages.
SCD Type 2 preserves history by creating a **new row** for each version of a hotel, while expiring the old row with an `expiry_date`. The fact table's `hotel_key` always points to the version that was active **at check-in date**, never to the current version if it changed.
| Column | Type | Description |
|--------|------|-------------|
| `hotel_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `source_hotel_id` | NUMBER(10) | Natural key from MySQL |
| `hotel_chain_key` | NUMBER(10) FK | NULL for independent hotels |
| `country_key` | NUMBER(10) FK | |
| `star_rating_key` | NUMBER(10) FK | |
| `code` | VARCHAR2(20) | |
| `name` | VARCHAR2(150) | |
| `city` | VARCHAR2(100) | |
| `effective_date` | DATE | When this version became active |
| `expiry_date` | DATE | When this version was superseded (NULL = current) |
| `is_current` | NUMBER(1) | 1 = current version |
**SCD2 example:**
| hotel_key | source_hotel_id | star_rating | effective_date | expiry_date | is_current |
|-----------|----------------|-------------|----------------|-------------|-----------|
| 1 | 42 | 3★ | 2022-01-01 | 2024-05-31 | 0 |
| 2 | 42 | 4★ | 2024-06-01 | NULL | 1 |
Bookings from 20222024 point to `hotel_key=1`, bookings from 2024 onward point to `hotel_key=2`. Revenue by star category remains historically correct.
---
### DIM_ROOM (SCD Type 1)
Room type is stable for our dataset. Updated via MERGE if room details ever change.
| Column | Type | Description |
|--------|------|-------------|
| `room_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `room_id` | NUMBER(10) UNIQUE | Natural key |
| `hotel_key` | NUMBER(10) FK | Points to current DIM_HOTEL version |
| `room_number` | VARCHAR2(10) | |
| `floor` | NUMBER(3) | |
| `room_type_code` | VARCHAR2(20) | e.g. `SUITE` |
| `room_type_desc` | VARCHAR2(100) | |
| `smoking_yn` | NUMBER(1) | |
| `standard_rate` | NUMBER(10,2) | Base rate from OLTP |
---
### DIM_GUEST (SCD Type 1)
Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used.
| Column | Type | Description |
|--------|------|-------------|
| `guest_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `guest_id` | NUMBER(10) UNIQUE | Natural key |
| `country_key` | NUMBER(10) FK | Home country |
| `name` | VARCHAR2(150) | |
| `city` | VARCHAR2(100) | |
---
## Fact Table: FACT_ROOM_BOOKING
**Grain:** One row per room_booking — one specific room, for one stay.
| Column | Type | Description |
|--------|------|-------------|
| `fact_id` | NUMBER(10) PK | Surrogate (IDENTITY) |
| `source_rb_id` | NUMBER(10) UNIQUE | Natural key — used for idempotent incremental loads |
| `hotel_key` | NUMBER(10) FK | SCD2-resolved hotel version at check-in |
| `hotel_chain_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
| `room_key` | NUMBER(10) FK | |
| `guest_key` | NUMBER(10) FK | |
| `country_key` | NUMBER(10) FK | Guest's country — denormalized |
| `star_rating_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
| `checkin_date_key` | NUMBER(8) FK | YYYYMMDD |
| `checkout_date_key` | NUMBER(8) FK | YYYYMMDD |
| `booking_status` | VARCHAR2(20) | Degenerate dimension: confirmed/completed/cancelled/no_show |
| `nights_stayed` | NUMBER(4) | checkout checkin in days |
| `nightly_rate` | NUMBER(10,2) | Rate per night at time of booking |
| `total_amount` | NUMBER(12,2) | `nightly_rate × nights_stayed` |
### Measures
| Measure | Type | Aggregation |
|---------|------|-------------|
| `nights_stayed` | Additive | SUM, AVG |
| `nightly_rate` | Semi-additive | AVG (not SUM — rate doesn't add across rooms meaningfully) |
| `total_amount` | Additive | SUM (main revenue measure) |
### Degenerate Dimensions
`booking_status` is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead.
---
## ETL Control Tables
### ETL_WATERMARK
Tracks the highest `room_booking_id` already loaded into the fact table, enabling incremental loads without re-reading the entire source.
| Column | Description |
|--------|-------------|
| `entity_name` | Logical entity name (e.g. `FACT_ROOM_BOOKING`) |
| `last_key` | Highest PK value loaded so far |
| `last_run_ts` | Timestamp of the last ETL run |
### STG_HOTEL
Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run.
---
## Sample Analytical Queries
### Revenue by country and quarter
```sql
SELECT
c.name AS country,
d.year,
d.quarter,
SUM(f.total_amount) AS revenue,
COUNT(*) AS room_nights
FROM FACT_ROOM_BOOKING f
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
JOIN DIM_GUEST g ON g.guest_key = f.guest_key
JOIN DIM_COUNTRY c ON c.country_key = g.country_key
WHERE f.booking_status = 'completed'
GROUP BY c.name, d.year, d.quarter
ORDER BY revenue DESC;
```
### Average revenue per star category (correct because of SCD2)
```sql
SELECT
sr.code AS stars,
d.season,
AVG(f.nightly_rate) AS avg_nightly_rate,
SUM(f.total_amount) AS total_revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
GROUP BY sr.code, d.season
ORDER BY sr.code, d.season;
```
### Top 10 cities by occupancy (room-nights)
```sql
SELECT
h.city,
SUM(f.nights_stayed) AS room_nights,
SUM(f.total_amount) AS revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
WHERE f.booking_status IN ('completed','confirmed')
GROUP BY h.city
ORDER BY room_nights DESC
FETCH FIRST 10 ROWS ONLY;
```