Conclusion

What Was Built

This project delivers a complete, working Data Warehouse pipeline for the Hotel Reservations domain:

Layer	What was built	Scale
OLTP	MySQL 8.4, 13-table normalized schema	~635,000 rows
Data generation	.NET 10 C# script, realistic seasonal distribution	500K bookings in ~3 min
ETL	Apache NiFi, 5 process groups	full + incremental loads
Data Mart	Oracle star schema, SCD Type 2 on DIM_HOTEL	1 fact + 6 dims

Design Decisions

Synthetic data generation instead of a Kaggle dataset

The decision to generate data rather than use a pre-existing dataset was deliberate. Publicly available hotel datasets are either too small (thousands of rows) or lack the normalized relational structure needed to demonstrate a realistic OLTP-to-DW pipeline. The generator produces statistically realistic data:

Seasonal booking distribution (summer peak, winter trough)
Realistic stay-length distribution (30% one-night stays)
Varied status distribution (80% completed, 10% confirmed, 7% cancelled, 3% no-show)
Revenue rates tied to actual seasonal pricing periods

SCD Type 2 on DIM_HOTEL only

SCD Type 2 adds operational complexity — it requires staging tables, a two-phase SQL update, and SCD2-aware fact inserts. Applying it to every dimension would make the ETL unnecessarily complex for the analytical benefit gained.

DIM_HOTEL is the right candidate because:

Star rating changes (3★→4★ after renovation) directly affect revenue benchmarks
Chain affiliation changes (hotel joins or leaves a franchise) affect chain-level reporting
Tracking these historically is the core value proposition of dimensional modelling

Guests, countries, room types, and hotel chains all change rarely or in ways that don't affect historical analysis — SCD Type 1 (overwrite) is appropriate.

Watermark-based incremental fact loading

The fact table uses source_rb_id (the MySQL room_booking_id) as a natural key and applies a NOT EXISTS guard on every insert. Combined with the ETL_WATERMARK table, this makes PG-5 both incremental (only processes new rows) and idempotent (safe to re-run without creating duplicates). This pattern is production-standard and would scale cleanly to a real operational system.

Integer date keys in DIM_DATE

date_key is stored as NUMBER(8) in YYYYMMDD format rather than a FK to a DATE column. This allows:

Fast range predicates: WHERE checkin_date_key BETWEEN 20240601 AND 20240831
No JOIN to get the date value when it's used directly in GROUP BY
Human-readable values in query results without formatting

Analytical Capabilities

The data mart enables the following categories of OLAP queries:

Revenue analysis:

Total revenue by country, city, hotel chain, star category
Revenue trend over time (monthly, quarterly, yearly)
Revenue split by booking status and room type

Occupancy analysis:

Room-nights sold per hotel, per season
Average stay duration by guest country
Cancellation rates by period and hotel category

SCD2-specific analysis:

Compare revenue performance of hotels before and after star rating upgrade
Identify which hotel version (chain affiliation) was more profitable

Guest origin analysis:

Which countries generate the most bookings and revenue
Cross-country booking patterns (guest country vs hotel country)

Limitations and Possible Extensions

Limitation	Possible extension
Static OLTP data (no live updates)	Add a NiFi timer to simulate ongoing bookings
No SCD2 on DIM_ROOM	Add room type tracking for renovation analysis
Single fact table	Add a second fact table for daily hotel occupancy (snapshot fact)
No data quality checks in NiFi	Add RouteOnAttribute + dead-letter queue for failed records
Oracle target is university lab	Package with Oracle XE Docker container for self-contained demo

3.9 KiB Raw Blame History