# Data Mart — Design & Details ## Overview The data mart uses a **star schema** stored in an Oracle database (university lab schema). It is optimized for analytical queries against hotel reservation data — revenue analysis, occupancy trends, seasonal patterns, and guest origin breakdowns. - **Schema type:** Star schema - **Dimensions:** 6 (+ date dimension) - **Fact table:** `FACT_ROOM_BOOKING` - **Grain:** One row per room_booking (one room, one stay) - **SCD strategy:** Type 2 on DIM_HOTEL, Type 1 on all others --- ## Star Schema Diagram ``` DIM_DATE (date_key) │ ┌───────────┴───────────┐ │ checkin / checkout │ │ │ DIM_HOTEL_CHAIN ◄─ DIM_HOTEL ─► DIM_STAR_RATING │ │ │ FACT_ROOM_BOOKING ◄──── DIM_ROOM │ │ └───────► DIM_COUNTRY ◄───── DIM_GUEST ``` --- ## Dimension Tables ### DIM_DATE Populated once for the range 2020–2030. Used for both check-in and check-out date lookups. | Column | Type | Description | |--------|------|-------------| | `date_key` | NUMBER(8) PK | YYYYMMDD integer key | | `full_date` | DATE | Actual date value | | `year` | NUMBER(4) | | | `quarter` | NUMBER(1) | 1–4 | | `month` | NUMBER(2) | 1–12 | | `month_name` | VARCHAR2(10) | e.g. `January` | | `week_number` | NUMBER(2) | ISO week number | | `day_of_month` | NUMBER(2) | | | `day_name` | VARCHAR2(10) | e.g. `Monday` | | `is_weekend` | NUMBER(1) | 0/1 | | `is_business_day` | NUMBER(1) | 0/1 | | `season` | VARCHAR2(10) | Peak / High / Autumn / Winter | Using an integer date key (YYYYMMDD) instead of a DATE FK allows efficient range predicates: `checkin_date_key BETWEEN 20240601 AND 20240831`. --- ### DIM_COUNTRY (SCD Type 1) Country attributes are stable. If a name or currency ever changes, the row is simply overwritten (no history needed). | Column | Type | Description | |--------|------|-------------| | `country_key` | NUMBER(10) PK | Surrogate (IDENTITY) | | `country_id` | NUMBER(10) UNIQUE | Natural key from MySQL | | `code` | CHAR(2) | ISO alpha-2 | | `name` | VARCHAR2(100) | | | `currency` | VARCHAR2(10) | ISO currency code | --- ### DIM_STAR_RATING (SCD Type 1) Static lookup. Star rating codes 1–5 never change. | Column | Type | Description | |--------|------|-------------| | `star_rating_key` | NUMBER(10) PK | Surrogate (IDENTITY) | | `star_rating_id` | NUMBER(10) UNIQUE | Natural key | | `code` | NUMBER(1) | 1–5 | | `description` | VARCHAR2(20) | e.g. `4 Star` | --- ### DIM_HOTEL_CHAIN (SCD Type 1) Chain name/code may be updated (e.g. corporate rebranding), but we do not need a historical record of chain name changes. | Column | Type | Description | |--------|------|-------------| | `hotel_chain_key` | NUMBER(10) PK | Surrogate (IDENTITY) | | `hotel_chain_id` | NUMBER(10) UNIQUE | Natural key | | `code` | VARCHAR2(10) | e.g. `HLT` | | `name` | VARCHAR2(100) | | --- ### DIM_HOTEL (SCD Type 2) This is the most analytically significant dimension and the only one implemented as **Slowly Changing Dimension Type 2**. **Why SCD Type 2 here?** A hotel's star rating or chain affiliation can change over time — a property gets renovated and reclassified from 3★ to 4★, or switches from one international chain to another. These changes directly affect revenue analysis: a 3★ hotel charges different rates than a 4★ hotel, and grouping all historical bookings under the current star rating would produce misleading averages. SCD Type 2 preserves history by creating a **new row** for each version of a hotel, while expiring the old row with an `expiry_date`. The fact table's `hotel_key` always points to the version that was active **at check-in date**, never to the current version if it changed. | Column | Type | Description | |--------|------|-------------| | `hotel_key` | NUMBER(10) PK | Surrogate (IDENTITY) | | `source_hotel_id` | NUMBER(10) | Natural key from MySQL | | `hotel_chain_key` | NUMBER(10) FK | NULL for independent hotels | | `country_key` | NUMBER(10) FK | | | `star_rating_key` | NUMBER(10) FK | | | `code` | VARCHAR2(20) | | | `name` | VARCHAR2(150) | | | `city` | VARCHAR2(100) | | | `effective_date` | DATE | When this version became active | | `expiry_date` | DATE | When this version was superseded (NULL = current) | | `is_current` | NUMBER(1) | 1 = current version | **SCD2 example:** | hotel_key | source_hotel_id | star_rating | effective_date | expiry_date | is_current | |-----------|----------------|-------------|----------------|-------------|-----------| | 1 | 42 | 3★ | 2022-01-01 | 2024-05-31 | 0 | | 2 | 42 | 4★ | 2024-06-01 | NULL | 1 | Bookings from 2022–2024 point to `hotel_key=1`, bookings from 2024 onward point to `hotel_key=2`. Revenue by star category remains historically correct. --- ### DIM_ROOM (SCD Type 1) Room type is stable for our dataset. Updated via MERGE if room details ever change. | Column | Type | Description | |--------|------|-------------| | `room_key` | NUMBER(10) PK | Surrogate (IDENTITY) | | `room_id` | NUMBER(10) UNIQUE | Natural key | | `hotel_key` | NUMBER(10) FK | Points to current DIM_HOTEL version | | `room_number` | VARCHAR2(10) | | | `floor` | NUMBER(3) | | | `room_type_code` | VARCHAR2(20) | e.g. `SUITE` | | `room_type_desc` | VARCHAR2(100) | | | `smoking_yn` | NUMBER(1) | | | `standard_rate` | NUMBER(10,2) | Base rate from OLTP | --- ### DIM_GUEST (SCD Type 1) Guest personal data (city, country) may change, but tracking historical addresses has no analytical value for this domain. MERGE (upsert) is used. | Column | Type | Description | |--------|------|-------------| | `guest_key` | NUMBER(10) PK | Surrogate (IDENTITY) | | `guest_id` | NUMBER(10) UNIQUE | Natural key | | `country_key` | NUMBER(10) FK | Home country | | `name` | VARCHAR2(150) | | | `city` | VARCHAR2(100) | | --- ## Fact Table: FACT_ROOM_BOOKING **Grain:** One row per room_booking — one specific room, for one stay. | Column | Type | Description | |--------|------|-------------| | `fact_id` | NUMBER(10) PK | Surrogate (IDENTITY) | | `source_rb_id` | NUMBER(10) UNIQUE | Natural key — used for idempotent incremental loads | | `hotel_key` | NUMBER(10) FK | SCD2-resolved hotel version at check-in | | `hotel_chain_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience | | `room_key` | NUMBER(10) FK | | | `guest_key` | NUMBER(10) FK | | | `country_key` | NUMBER(10) FK | Guest's country — denormalized | | `star_rating_key` | NUMBER(10) FK | Denormalized from DIM_HOTEL for convenience | | `checkin_date_key` | NUMBER(8) FK | YYYYMMDD | | `checkout_date_key` | NUMBER(8) FK | YYYYMMDD | | `booking_status` | VARCHAR2(20) | Degenerate dimension: confirmed/completed/cancelled/no_show | | `nights_stayed` | NUMBER(4) | checkout − checkin in days | | `nightly_rate` | NUMBER(10,2) | Rate per night at time of booking | | `total_amount` | NUMBER(12,2) | `nightly_rate × nights_stayed` | ### Measures | Measure | Type | Aggregation | |---------|------|-------------| | `nights_stayed` | Additive | SUM, AVG | | `nightly_rate` | Semi-additive | AVG (not SUM — rate doesn't add across rooms meaningfully) | | `total_amount` | Additive | SUM (main revenue measure) | ### Degenerate Dimensions `booking_status` is stored directly on the fact row. Splitting it into a separate dimension table would add a table with only 4 rows and no other attributes — not worth the JOIN overhead. --- ## ETL Control Tables ### ETL_WATERMARK Tracks the highest `room_booking_id` already loaded into the fact table, enabling incremental loads without re-reading the entire source. | Column | Description | |--------|-------------| | `entity_name` | Logical entity name (e.g. `FACT_ROOM_BOOKING`) | | `last_key` | Highest PK value loaded so far | | `last_run_ts` | Timestamp of the last ETL run | ### STG_HOTEL Staging table used by the SCD2 ETL process. NiFi loads raw hotel data from MySQL here, then SQL applies the expire-and-insert SCD2 logic in a single transaction. Truncated at the start of each ETL run. --- ## Sample Analytical Queries ### Revenue by country and quarter ```sql SELECT c.name AS country, d.year, d.quarter, SUM(f.total_amount) AS revenue, COUNT(*) AS room_nights FROM FACT_ROOM_BOOKING f JOIN DIM_DATE d ON d.date_key = f.checkin_date_key JOIN DIM_GUEST g ON g.guest_key = f.guest_key JOIN DIM_COUNTRY c ON c.country_key = g.country_key WHERE f.booking_status = 'completed' GROUP BY c.name, d.year, d.quarter ORDER BY revenue DESC; ``` ### Average revenue per star category (correct because of SCD2) ```sql SELECT sr.code AS stars, d.season, AVG(f.nightly_rate) AS avg_nightly_rate, SUM(f.total_amount) AS total_revenue FROM FACT_ROOM_BOOKING f JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key JOIN DIM_STAR_RATING sr ON sr.star_rating_key = f.star_rating_key JOIN DIM_DATE d ON d.date_key = f.checkin_date_key GROUP BY sr.code, d.season ORDER BY sr.code, d.season; ``` ### Top 10 cities by occupancy (room-nights) ```sql SELECT h.city, SUM(f.nights_stayed) AS room_nights, SUM(f.total_amount) AS revenue FROM FACT_ROOM_BOOKING f JOIN DIM_HOTEL h ON h.hotel_key = f.hotel_key WHERE f.booking_status IN ('completed','confirmed') GROUP BY h.city ORDER BY room_nights DESC FETCH FIRST 10 ROWS ONLY; ```