Files
IPZ_1/docs/03-datamart.md
2026-05-17 21:27:42 +02:00

9.4 KiB
Raw Permalink Blame History

Data Mart — Design & Details

Overview

The data mart uses a star schema stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns.

  • Schema type: Star schema
  • Dimensions: 6 (+ date dimension)
  • Fact table: FACT_ROOM_BOOKING
  • Grain: One row per room_booking (one room, one stay)
  • SCD strategy: Type 2 on DIM_HOTEL, Type 1 on all others

Star Schema Diagram

                        DIM_DATE
                       (date_key)
                           │
               ┌───────────┴───────────┐
               │ checkin / checkout    │
               │                       │
DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING
       │               │
       │      FACT_ROOM_BOOKING ◄──── DIM_ROOM
       │               │
       └───────►   DIM_COUNTRY ◄───── DIM_GUEST

Dimension Tables

DIM_DATE

Populated once for the range 20202030. Used for both check-in and check-out date lookups.

Column Type Description
date_key NUMBER(8) PK YYYYMMDD integer key
full_date DATE Actual date value
year NUMBER(4)
quarter NUMBER(1) 14
month NUMBER(2) 112
month_name VARCHAR2(10) e.g. January
week_number NUMBER(2) ISO week number
day_of_month NUMBER(2)
day_name VARCHAR2(10) e.g. Monday
is_weekend NUMBER(1) 0/1
is_business_day NUMBER(1) 0/1
season VARCHAR2(10) Peak / High / Autumn / Winter

Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: checkin_date_key BETWEEN 20240601 AND 20240831.


DIM_COUNTRY (SCD Type 1)

Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed).

Column Type Description
country_key NUMBER(10) PK Surrogate (IDENTITY)
country_id NUMBER(10) UNIQUE Natural key from MySQL
code CHAR(2) ISO alpha-2
name VARCHAR2(100)
currency VARCHAR2(10) ISO currency code

DIM_STAR_RATING (SCD Type 1)

Static lookup. Star rating codes 15 never change.

Column Type Description
star_rating_key NUMBER(10) PK Surrogate (IDENTITY)
star_rating_id NUMBER(10) UNIQUE Natural key
code NUMBER(1) 15
description VARCHAR2(20) e.g. 4 Star

DIM_HOTEL_CHAIN (SCD Type 1)

Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes.

Column Type Description
hotel_chain_key NUMBER(10) PK Surrogate (IDENTITY)
hotel_chain_id NUMBER(10) UNIQUE Natural key
code VARCHAR2(10) e.g. HLT
name VARCHAR2(100)

DIM_HOTEL (SCD Type 2)

This is the most analytically significant dimension and the only one implemented as Slowly Changing Dimension Type 2.

Why SCD Type 2 here?

A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages.

SCD Type 2 preserves history by creating a new row for each version of a hotel, while expiring the old row with an expiry_date. The fact table's hotel_key always points to the version that was active at check-in date, never to the current version if it changed.

Column Type Description
hotel_key NUMBER(10) PK Surrogate (IDENTITY)
source_hotel_id NUMBER(10) Natural key from MySQL
hotel_chain_key NUMBER(10) FK NULL for independent hotels
country_key NUMBER(10) FK
star_rating_key NUMBER(10) FK
code VARCHAR2(20)
name VARCHAR2(150)
city VARCHAR2(100)
effective_date DATE When this version became active
expiry_date DATE When this version was superseded (NULL = current)
is_current NUMBER(1) 1 = current version

SCD2 example:

hotel_key source_hotel_id star_rating effective_date expiry_date is_current
1 42 3★ 2022-01-01 2024-05-31 0
2 42 4★ 2024-06-01 NULL 1

Bookings from 20222024 point to hotel_key=1, bookings from 2024 onward point to hotel_key=2. Revenue by star category remains historically correct.


DIM_ROOM (SCD Type 1)

Room type is stable for our dataset. Updated via MERGE if room details ever change.

Column Type Description
room_key NUMBER(10) PK Surrogate (IDENTITY)
room_id NUMBER(10) UNIQUE Natural key
hotel_key NUMBER(10) FK Points to current DIM_HOTEL version
room_number VARCHAR2(10)
floor NUMBER(3)
room_type_code VARCHAR2(20) e.g. SUITE
room_type_desc VARCHAR2(100)
smoking_yn NUMBER(1)
standard_rate NUMBER(10,2) Base rate from OLTP

DIM_GUEST (SCD Type 1)

Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used.

Column Type Description
guest_key NUMBER(10) PK Surrogate (IDENTITY)
guest_id NUMBER(10) UNIQUE Natural key
country_key NUMBER(10) FK Home country
name VARCHAR2(150)
city VARCHAR2(100)

Fact Table: FACT_ROOM_BOOKING

Grain: One row per room_booking — one specific room, for one stay.

Column Type Description
fact_id NUMBER(10) PK Surrogate (IDENTITY)
source_rb_id NUMBER(10) UNIQUE Natural key — used for idempotent incremental loads
hotel_key NUMBER(10) FK SCD2-resolved hotel version at check-in
hotel_chain_key NUMBER(10) FK Denormalized from DIM_HOTEL for convenience
room_key NUMBER(10) FK
guest_key NUMBER(10) FK
country_key NUMBER(10) FK Guest's country — denormalized
star_rating_key NUMBER(10) FK Denormalized from DIM_HOTEL for convenience
checkin_date_key NUMBER(8) FK YYYYMMDD
checkout_date_key NUMBER(8) FK YYYYMMDD
booking_status VARCHAR2(20) Degenerate dimension: confirmed/completed/cancelled/no_show
nights_stayed NUMBER(4) checkout checkin in days
nightly_rate NUMBER(10,2) Rate per night at time of booking
total_amount NUMBER(12,2) nightly_rate × nights_stayed

Measures

Measure Type Aggregation
nights_stayed Additive SUM, AVG
nightly_rate Semi-additive AVG (not SUM — rate doesn't add across rooms meaningfully)
total_amount Additive SUM (main revenue measure)

Degenerate Dimensions

booking_status is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead.


ETL Control Tables

ETL_WATERMARK

Tracks the highest room_booking_id already loaded into the fact table, enabling incremental loads without re-reading the entire source.

Column Description
entity_name Logical entity name (e.g. FACT_ROOM_BOOKING)
last_key Highest PK value loaded so far
last_run_ts Timestamp of the last ETL run

STG_HOTEL

Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run.


Sample Analytical Queries

Revenue by country and quarter

SELECT
    c.name          AS country,
    d.year,
    d.quarter,
    SUM(f.total_amount)     AS revenue,
    COUNT(*)                AS room_nights
FROM FACT_ROOM_BOOKING f
JOIN DIM_DATE    d ON d.date_key    = f.checkin_date_key
JOIN DIM_GUEST   g ON g.guest_key  = f.guest_key
JOIN DIM_COUNTRY c ON c.country_key = g.country_key
WHERE f.booking_status = 'completed'
GROUP BY c.name, d.year, d.quarter
ORDER BY revenue DESC;

Average revenue per star category (correct because of SCD2)

SELECT
    sr.code         AS stars,
    d.season,
    AVG(f.nightly_rate)     AS avg_nightly_rate,
    SUM(f.total_amount)     AS total_revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL       h  ON h.hotel_key       = f.hotel_key
JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key
JOIN DIM_DATE        d  ON d.date_key         = f.checkin_date_key
GROUP BY sr.code, d.season
ORDER BY sr.code, d.season;

Top 10 cities by occupancy (room-nights)

SELECT
    h.city,
    SUM(f.nights_stayed)   AS room_nights,
    SUM(f.total_amount)    AS revenue
FROM FACT_ROOM_BOOKING f
JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key
WHERE f.booking_status IN ('completed','confirmed')
GROUP BY h.city
ORDER BY room_nights DESC
FETCH FIRST 10 ROWS ONLY;