9.4 KiB
Data Mart — Design & Details
Overview
The data mart uses a star schema stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns.
- Schema type: Star schema
- Dimensions: 6 (+ date dimension)
- Fact table:
FACT_ROOM_BOOKING - Grain: One row per room_booking (one room, one stay)
- SCD strategy: Type 2 on DIM_HOTEL, Type 1 on all others
Star Schema Diagram
DIM_DATE
(date_key)
│
┌───────────┴───────────┐
│ checkin / checkout │
│ │
DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING
│ │
│ FACT_ROOM_BOOKING ◄──── DIM_ROOM
│ │
└───────► DIM_COUNTRY ◄───── DIM_GUEST
Dimension Tables
DIM_DATE
Populated once for the range 2020–2030. Used for both check-in and check-out date lookups.
| Column | Type | Description |
|---|---|---|
date_key |
NUMBER(8) PK | YYYYMMDD integer key |
full_date |
DATE | Actual date value |
year |
NUMBER(4) | |
quarter |
NUMBER(1) | 1–4 |
month |
NUMBER(2) | 1–12 |
month_name |
VARCHAR2(10) | e.g. January |
week_number |
NUMBER(2) | ISO week number |
day_of_month |
NUMBER(2) | |
day_name |
VARCHAR2(10) | e.g. Monday |
is_weekend |
NUMBER(1) | 0/1 |
is_business_day |
NUMBER(1) | 0/1 |
season |
VARCHAR2(10) | Peak / High / Autumn / Winter |
Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: checkin_date_key BETWEEN 20240601 AND 20240831.
DIM_COUNTRY (SCD Type 1)
Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed).
| Column | Type | Description |
|---|---|---|
country_key |
NUMBER(10) PK | Surrogate (IDENTITY) |
country_id |
NUMBER(10) UNIQUE | Natural key from MySQL |
code |
CHAR(2) | ISO alpha-2 |
name |
VARCHAR2(100) | |
currency |
VARCHAR2(10) | ISO currency code |
DIM_STAR_RATING (SCD Type 1)
Static lookup. Star rating codes 1–5 never change.
| Column | Type | Description |
|---|---|---|
star_rating_key |
NUMBER(10) PK | Surrogate (IDENTITY) |
star_rating_id |
NUMBER(10) UNIQUE | Natural key |
code |
NUMBER(1) | 1–5 |
description |
VARCHAR2(20) | e.g. 4 Star |
DIM_HOTEL_CHAIN (SCD Type 1)
Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes.
| Column | Type | Description |
|---|---|---|
hotel_chain_key |
NUMBER(10) PK | Surrogate (IDENTITY) |
hotel_chain_id |
NUMBER(10) UNIQUE | Natural key |
code |
VARCHAR2(10) | e.g. HLT |
name |
VARCHAR2(100) |
DIM_HOTEL (SCD Type 2)
This is the most analytically significant dimension and the only one implemented as Slowly Changing Dimension Type 2.
Why SCD Type 2 here?
A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages.
SCD Type 2 preserves history by creating a new row for each version of a hotel, while expiring the old row with an expiry_date. The fact table's hotel_key always points to the version that was active at check-in date, never to the current version if it changed.
| Column | Type | Description |
|---|---|---|
hotel_key |
NUMBER(10) PK | Surrogate (IDENTITY) |
source_hotel_id |
NUMBER(10) | Natural key from MySQL |
hotel_chain_key |
NUMBER(10) FK | NULL for independent hotels |
country_key |
NUMBER(10) FK | |
star_rating_key |
NUMBER(10) FK | |
code |
VARCHAR2(20) | |
name |
VARCHAR2(150) | |
city |
VARCHAR2(100) | |
effective_date |
DATE | When this version became active |
expiry_date |
DATE | When this version was superseded (NULL = current) |
is_current |
NUMBER(1) | 1 = current version |
SCD2 example:
| hotel_key | source_hotel_id | star_rating | effective_date | expiry_date | is_current |
|---|---|---|---|---|---|
| 1 | 42 | 3★ | 2022-01-01 | 2024-05-31 | 0 |
| 2 | 42 | 4★ | 2024-06-01 | NULL | 1 |
Bookings from 2022–2024 point to hotel_key=1, bookings from 2024 onward point to hotel_key=2. Revenue by star category remains historically correct.
DIM_ROOM (SCD Type 1)
Room type is stable for our dataset. Updated via MERGE if room details ever change.
| Column | Type | Description |
|---|---|---|
room_key |
NUMBER(10) PK | Surrogate (IDENTITY) |
room_id |
NUMBER(10) UNIQUE | Natural key |
hotel_key |
NUMBER(10) FK | Points to current DIM_HOTEL version |
room_number |
VARCHAR2(10) | |
floor |
NUMBER(3) | |
room_type_code |
VARCHAR2(20) | e.g. SUITE |
room_type_desc |
VARCHAR2(100) | |
smoking_yn |
NUMBER(1) | |
standard_rate |
NUMBER(10,2) | Base rate from OLTP |
DIM_GUEST (SCD Type 1)
Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used.
| Column | Type | Description |
|---|---|---|
guest_key |
NUMBER(10) PK | Surrogate (IDENTITY) |
guest_id |
NUMBER(10) UNIQUE | Natural key |
country_key |
NUMBER(10) FK | Home country |
name |
VARCHAR2(150) | |
city |
VARCHAR2(100) |
Fact Table: FACT_ROOM_BOOKING
Grain: One row per room_booking — one specific room, for one stay.
| Column | Type | Description |
|---|---|---|
fact_id |
NUMBER(10) PK | Surrogate (IDENTITY) |
source_rb_id |
NUMBER(10) UNIQUE | Natural key — used for idempotent incremental loads |
hotel_key |
NUMBER(10) FK | SCD2-resolved hotel version at check-in |
hotel_chain_key |
NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
room_key |
NUMBER(10) FK | |
guest_key |
NUMBER(10) FK | |
country_key |
NUMBER(10) FK | Guest's country — denormalized |
star_rating_key |
NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience |
checkin_date_key |
NUMBER(8) FK | YYYYMMDD |
checkout_date_key |
NUMBER(8) FK | YYYYMMDD |
booking_status |
VARCHAR2(20) | Degenerate dimension: confirmed/completed/cancelled/no_show |
nights_stayed |
NUMBER(4) | checkout − checkin in days |
nightly_rate |
NUMBER(10,2) | Rate per night at time of booking |
total_amount |
NUMBER(12,2) | nightly_rate × nights_stayed |
Measures
| Measure | Type | Aggregation |
|---|---|---|
nights_stayed |
Additive | SUM, AVG |
nightly_rate |
Semi-additive | AVG (not SUM — rate doesn't add across rooms meaningfully) |
total_amount |
Additive | SUM (main revenue measure) |
Degenerate Dimensions
booking_status is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead.
ETL Control Tables
ETL_WATERMARK
Tracks the highest room_booking_id already loaded into the fact table, enabling incremental loads without re-reading the entire source.
| Column | Description |
|---|---|
entity_name |
Logical entity name (e.g. FACT_ROOM_BOOKING) |
last_key |
Highest PK value loaded so far |
last_run_ts |
Timestamp of the last ETL run |
STG_HOTEL
Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run.
Sample Analytical Queries
Revenue by country and quarter
SELECT
c.name AS country,
d.year,
d.quarter,
SUM(f.total_amount) AS revenue,
COUNT(*) AS room_nights
FROM FACT_ROOM_BOOKING f
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
JOIN DIM_GUEST g ON g.guest_key = f.guest_key
JOIN DIM_COUNTRY c ON c.country_key = g.country_key
WHERE f.booking_status = 'completed'
GROUP BY c.name, d.year, d.quarter
ORDER BY revenue DESC;
Average revenue per star category (correct because of SCD2)
SELECT
sr.code AS stars,
d.season,
AVG(f.nightly_rate) AS avg_nightly_rate,
SUM(f.total_amount) AS total_revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key
JOIN DIM_DATE d ON d.date_key = f.checkin_date_key
GROUP BY sr.code, d.season
ORDER BY sr.code, d.season;
Top 10 cities by occupancy (room-nights)
SELECT
h.city,
SUM(f.nights_stayed) AS room_nights,
SUM(f.total_amount) AS revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
WHERE f.booking_status IN ('completed','confirmed')
GROUP BY h.city
ORDER BY room_nights DESC
FETCH FIRST 10 ROWS ONLY;