256 lines
9.4 KiB
Markdown
256 lines
9.4 KiB
Markdown
# Data Mart — Design & Details
|
||
|
||
## Overview
|
||
|
||
The data mart uses a **star schema** stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns.
|
||
|
||
- **Schema type:** Star schema
|
||
- **Dimensions:** 6 (+ date dimension)
|
||
- **Fact table:** `FACT_ROOM_BOOKING`
|
||
- **Grain:** One row per room_booking (one room, one stay)
|
||
- **SCD strategy:** Type 2 on DIM_HOTEL, Type 1 on all others
|
||
|
||
---
|
||
|
||
## Star Schema Diagram
|
||
|
||
```
|
||
DIM_DATE
|
||
(date_key)
|
||
│
|
||
┌───────────┴───────────┐
|
||
│ checkin / checkout │
|
||
│ │
|
||
DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING
|
||
│ │
|
||
│ FACT_ROOM_BOOKING ◄──── DIM_ROOM
|
||
│ │
|
||
└───────► DIM_COUNTRY ◄───── DIM_GUEST
|
||
```
|
||
|
||
---
|
||
|
||
## Dimension Tables
|
||
|
||
### DIM_DATE
|
||
Populated once for the range 2020–2030. Used for both check-in and check-out date lookups.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `date_key` | NUMBER(8) PK | YYYYMMDD integer key |
|
||
| `full_date` | DATE | Actual date value |
|
||
| `year` | NUMBER(4) | |
|
||
| `quarter` | NUMBER(1) | 1–4 |
|
||
| `month` | NUMBER(2) | 1–12 |
|
||
| `month_name` | VARCHAR2(10) | e.g. `January` |
|
||
| `week_number` | NUMBER(2) | ISO week number |
|
||
| `day_of_month` | NUMBER(2) | |
|
||
| `day_name` | VARCHAR2(10) | e.g. `Monday` |
|
||
| `is_weekend` | NUMBER(1) | 0/1 |
|
||
| `is_business_day` | NUMBER(1) | 0/1 |
|
||
| `season` | VARCHAR2(10) | Peak / High / Autumn / Winter |
|
||
|
||
Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: `checkin_date_key BETWEEN 20240601 AND 20240831`.
|
||
|
||
---
|
||
|
||
### DIM_COUNTRY (SCD Type 1)
|
||
Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed).
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `country_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `country_id` | NUMBER(10) UNIQUE | Natural key from MySQL |
|
||
| `code` | CHAR(2) | ISO alpha-2 |
|
||
| `name` | VARCHAR2(100) | |
|
||
| `currency` | VARCHAR2(10) | ISO currency code |
|
||
|
||
---
|
||
|
||
### DIM_STAR_RATING (SCD Type 1)
|
||
Static lookup. Star rating codes 1–5 never change.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `star_rating_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `star_rating_id` | NUMBER(10) UNIQUE | Natural key |
|
||
| `code` | NUMBER(1) | 1–5 |
|
||
| `description` | VARCHAR2(20) | e.g. `4 Star` |
|
||
|
||
---
|
||
|
||
### DIM_HOTEL_CHAIN (SCD Type 1)
|
||
Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `hotel_chain_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `hotel_chain_id` | NUMBER(10) UNIQUE | Natural key |
|
||
| `code` | VARCHAR2(10) | e.g. `HLT` |
|
||
| `name` | VARCHAR2(100) | |
|
||
|
||
---
|
||
|
||
### DIM_HOTEL (SCD Type 2)
|
||
|
||
This is the most analytically significant dimension and the only one implemented as **Slowly Changing Dimension Type 2**.
|
||
|
||
**Why SCD Type 2 here?**
|
||
|
||
A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages.
|
||
|
||
SCD Type 2 preserves history by creating a **new row** for each version of a hotel, while expiring the old row with an `expiry_date`. The fact table's `hotel_key` always points to the version that was active **at check-in date**, never to the current version if it changed.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `hotel_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `source_hotel_id` | NUMBER(10) | Natural key from MySQL |
|
||
| `hotel_chain_key` | NUMBER(10) FK | NULL for independent hotels |
|
||
| `country_key` | NUMBER(10) FK | |
|
||
| `star_rating_key` | NUMBER(10) FK | |
|
||
| `code` | VARCHAR2(20) | |
|
||
| `name` | VARCHAR2(150) | |
|
||
| `city` | VARCHAR2(100) | |
|
||
| `effective_date` | DATE | When this version became active |
|
||
| `expiry_date` | DATE | When this version was superseded (NULL = current) |
|
||
| `is_current` | NUMBER(1) | 1 = current version |
|
||
|
||
**SCD2 example:**
|
||
|
||
| hotel_key | source_hotel_id | star_rating | effective_date | expiry_date | is_current |
|
||
|-----------|----------------|-------------|----------------|-------------|-----------|
|
||
| 1 | 42 | 3★ | 2022-01-01 | 2024-05-31 | 0 |
|
||
| 2 | 42 | 4★ | 2024-06-01 | NULL | 1 |
|
||
|
||
Bookings from 2022–2024 point to `hotel_key=1`, bookings from 2024 onward point to `hotel_key=2`. Revenue by star category remains historically correct.
|
||
|
||
---
|
||
|
||
### DIM_ROOM (SCD Type 1)
|
||
Room type is stable for our dataset. Updated via MERGE if room details ever change.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `room_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `room_id` | NUMBER(10) UNIQUE | Natural key |
|
||
| `hotel_key` | NUMBER(10) FK | Points to current DIM_HOTEL version |
|
||
| `room_number` | VARCHAR2(10) | |
|
||
| `floor` | NUMBER(3) | |
|
||
| `room_type_code` | VARCHAR2(20) | e.g. `SUITE` |
|
||
| `room_type_desc` | VARCHAR2(100) | |
|
||
| `smoking_yn` | NUMBER(1) | |
|
||
| `standard_rate` | NUMBER(10,2) | Base rate from OLTP |
|
||
|
||
---
|
||
|
||
### DIM_GUEST (SCD Type 1)
|
||
Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `guest_key` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `guest_id` | NUMBER(10) UNIQUE | Natural key |
|
||
| `country_key` | NUMBER(10) FK | Home country |
|
||
| `name` | VARCHAR2(150) | |
|
||
| `city` | VARCHAR2(100) | |
|
||
|
||
---
|
||
|
||
## Fact Table: FACT_ROOM_BOOKING
|
||
|
||
**Grain:** One row per room_booking — one specific room, for one stay.
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| `fact_id` | NUMBER(10) PK | Surrogate (IDENTITY) |
|
||
| `source_rb_id` | NUMBER(10) UNIQUE | Natural key — used for idempotent incremental loads |
|
||
| `hotel_key` | NUMBER(10) FK | SCD2-resolved hotel version at check-in |
|
||
| `hotel_chain_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
|
||
| `room_key` | NUMBER(10) FK | |
|
||
| `guest_key` | NUMBER(10) FK | |
|
||
| `country_key` | NUMBER(10) FK | Guest's country — denormalized |
|
||
| `star_rating_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
|
||
| `checkin_date_key` | NUMBER(8) FK | YYYYMMDD |
|
||
| `checkout_date_key` | NUMBER(8) FK | YYYYMMDD |
|
||
| `booking_status` | VARCHAR2(20) | Degenerate dimension: confirmed/completed/cancelled/no_show |
|
||
| `nights_stayed` | NUMBER(4) | checkout − checkin in days |
|
||
| `nightly_rate` | NUMBER(10,2) | Rate per night at time of booking |
|
||
| `total_amount` | NUMBER(12,2) | `nightly_rate × nights_stayed` |
|
||
|
||
### Measures
|
||
|
||
| Measure | Type | Aggregation |
|
||
|---------|------|-------------|
|
||
| `nights_stayed` | Additive | SUM, AVG |
|
||
| `nightly_rate` | Semi-additive | AVG (not SUM — rate doesn't add across rooms meaningfully) |
|
||
| `total_amount` | Additive | SUM (main revenue measure) |
|
||
|
||
### Degenerate Dimensions
|
||
`booking_status` is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead.
|
||
|
||
---
|
||
|
||
## ETL Control Tables
|
||
|
||
### ETL_WATERMARK
|
||
Tracks the highest `room_booking_id` already loaded into the fact table, enabling incremental loads without re-reading the entire source.
|
||
|
||
| Column | Description |
|
||
|--------|-------------|
|
||
| `entity_name` | Logical entity name (e.g. `FACT_ROOM_BOOKING`) |
|
||
| `last_key` | Highest PK value loaded so far |
|
||
| `last_run_ts` | Timestamp of the last ETL run |
|
||
|
||
### STG_HOTEL
|
||
Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run.
|
||
|
||
---
|
||
|
||
## Sample Analytical Queries
|
||
|
||
### Revenue by country and quarter
|
||
```sql
|
||
SELECT
|
||
c.name AS country,
|
||
d.year,
|
||
d.quarter,
|
||
SUM(f.total_amount) AS revenue,
|
||
COUNT(*) AS room_nights
|
||
FROM FACT_ROOM_BOOKING f
|
||
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
|
||
JOIN DIM_GUEST g ON g.guest_key = f.guest_key
|
||
JOIN DIM_COUNTRY c ON c.country_key = g.country_key
|
||
WHERE f.booking_status = 'completed'
|
||
GROUP BY c.name, d.year, d.quarter
|
||
ORDER BY revenue DESC;
|
||
```
|
||
|
||
### Average revenue per star category (correct because of SCD2)
|
||
```sql
|
||
SELECT
|
||
sr.code AS stars,
|
||
d.season,
|
||
AVG(f.nightly_rate) AS avg_nightly_rate,
|
||
SUM(f.total_amount) AS total_revenue
|
||
FROM FACT_ROOM_BOOKING f
|
||
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
|
||
JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key
|
||
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
|
||
GROUP BY sr.code, d.season
|
||
ORDER BY sr.code, d.season;
|
||
```
|
||
|
||
### Top 10 cities by occupancy (room-nights)
|
||
```sql
|
||
SELECT
|
||
h.city,
|
||
SUM(f.nights_stayed) AS room_nights,
|
||
SUM(f.total_amount) AS revenue
|
||
FROM FACT_ROOM_BOOKING f
|
||
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
|
||
WHERE f.booking_status IN ('completed','confirmed')
|
||
GROUP BY h.city
|
||
ORDER BY room_nights DESC
|
||
FETCH FIRST 10 ROWS ONLY;
|
||
```
|