Files
IPZ_1/docs/05-conclusion.md
2026-05-17 21:27:42 +02:00

3.9 KiB

Conclusion

What Was Built

This project delivers a complete, working Data Warehouse pipeline for the Hotel Reservations domain:

Layer What was built Scale
OLTP MySQL 8.4, 13-table normalized schema ~635,000 rows
Data generation .NET 10 C# script, realistic seasonal distribution 500K bookings in ~3 min
ETL Apache NiFi, 5 process groups full + incremental loads
Data Mart Oracle star schema, SCD Type 2 on DIM_HOTEL 1 fact + 6 dims

Design Decisions

Synthetic data generation instead of a Kaggle dataset

The decision to generate data rather than use a pre-existing dataset was deliberate. Publicly available hotel datasets are either too small (thousands of rows) or lack the normalized relational structure needed to demonstrate a realistic OLTP-to-DW pipeline. The generator produces statistically realistic data:

  • Seasonal booking distribution (summer peak, winter trough)
  • Realistic stay-length distribution (30% one-night stays)
  • Varied status distribution (80% completed, 10% confirmed, 7% cancelled, 3% no-show)
  • Revenue rates tied to actual seasonal pricing periods

SCD Type 2 on DIM_HOTEL only

SCD Type 2 adds operational complexity — it requires staging tables, a two-phase SQL update, and SCD2-aware fact inserts. Applying it to every dimension would make the ETL unnecessarily complex for the analytical benefit gained.

DIM_HOTEL is the right candidate because:

  • Star rating changes (3★→4★ after renovation) directly affect revenue benchmarks
  • Chain affiliation changes (hotel joins or leaves a franchise) affect chain-level reporting
  • Tracking these historically is the core value proposition of dimensional modelling

Guests, countries, room types, and hotel chains all change rarely or in ways that don't affect historical analysis — SCD Type 1 (overwrite) is appropriate.

Watermark-based incremental fact loading

The fact table uses source_rb_id (the MySQL room_booking_id) as a natural key and applies a NOT EXISTS guard on every insert. Combined with the ETL_WATERMARK table, this makes PG-5 both incremental (only processes new rows) and idempotent (safe to re-run without creating duplicates). This pattern is production-standard and would scale cleanly to a real operational system.

Integer date keys in DIM_DATE

date_key is stored as NUMBER(8) in YYYYMMDD format rather than a FK to a DATE column. This allows:

  • Fast range predicates: WHERE checkin_date_key BETWEEN 20240601 AND 20240831
  • No JOIN to get the date value when it's used directly in GROUP BY
  • Human-readable values in query results without formatting

Analytical Capabilities

The data mart enables the following categories of OLAP queries:

Revenue analysis:

  • Total revenue by country, city, hotel chain, star category
  • Revenue trend over time (monthly, quarterly, yearly)
  • Revenue split by booking status and room type

Occupancy analysis:

  • Room-nights sold per hotel, per season
  • Average stay duration by guest country
  • Cancellation rates by period and hotel category

SCD2-specific analysis:

  • Compare revenue performance of hotels before and after star rating upgrade
  • Identify which hotel version (chain affiliation) was more profitable

Guest origin analysis:

  • Which countries generate the most bookings and revenue
  • Cross-country booking patterns (guest country vs hotel country)

Limitations and Possible Extensions

Limitation Possible extension
Static OLTP data (no live updates) Add a NiFi timer to simulate ongoing bookings
No SCD2 on DIM_ROOM Add room type tracking for renovation analysis
Single fact table Add a second fact table for daily hotel occupancy (snapshot fact)
No data quality checks in NiFi Add RouteOnAttribute + dead-letter queue for failed records
Oracle target is university lab Package with Oracle XE Docker container for self-contained demo