3.9 KiB
Conclusion
What Was Built
This project delivers a complete, working Data Warehouse pipeline for the Hotel Reservations domain:
| Layer | What was built | Scale |
|---|---|---|
| OLTP | MySQL 8.4, 13-table normalized schema | ~635,000 rows |
| Data generation | .NET 10 C# script, realistic seasonal distribution | 500K bookings in ~3 min |
| ETL | Apache NiFi, 5 process groups | full + incremental loads |
| Data Mart | Oracle star schema, SCD Type 2 on DIM_HOTEL | 1 fact + 6 dims |
Design Decisions
Synthetic data generation instead of a Kaggle dataset
The decision to generate data rather than use a pre-existing dataset was deliberate. Publicly available hotel datasets are either too small (thousands of rows) or lack the normalized relational structure needed to demonstrate a realistic OLTP-to-DW pipeline. The generator produces statistically realistic data:
- Seasonal booking distribution (summer peak, winter trough)
- Realistic stay-length distribution (30% one-night stays)
- Varied status distribution (80% completed, 10% confirmed, 7% cancelled, 3% no-show)
- Revenue rates tied to actual seasonal pricing periods
SCD Type 2 on DIM_HOTEL only
SCD Type 2 adds operational complexity — it requires staging tables, a two-phase SQL update, and SCD2-aware fact inserts. Applying it to every dimension would make the ETL unnecessarily complex for the analytical benefit gained.
DIM_HOTEL is the right candidate because:
- Star rating changes (3★→4★ after renovation) directly affect revenue benchmarks
- Chain affiliation changes (hotel joins or leaves a franchise) affect chain-level reporting
- Tracking these historically is the core value proposition of dimensional modelling
Guests, countries, room types, and hotel chains all change rarely or in ways that don't affect historical analysis — SCD Type 1 (overwrite) is appropriate.
Watermark-based incremental fact loading
The fact table uses source_rb_id (the MySQL room_booking_id) as a natural key and applies a NOT EXISTS guard on every insert. Combined with the ETL_WATERMARK table, this makes PG-5 both incremental (only processes new rows) and idempotent (safe to re-run without creating duplicates). This pattern is production-standard and would scale cleanly to a real operational system.
Integer date keys in DIM_DATE
date_key is stored as NUMBER(8) in YYYYMMDD format rather than a FK to a DATE column. This allows:
- Fast range predicates:
WHERE checkin_date_key BETWEEN 20240601 AND 20240831 - No JOIN to get the date value when it's used directly in GROUP BY
- Human-readable values in query results without formatting
Analytical Capabilities
The data mart enables the following categories of OLAP queries:
Revenue analysis:
- Total revenue by country, city, hotel chain, star category
- Revenue trend over time (monthly, quarterly, yearly)
- Revenue split by booking status and room type
Occupancy analysis:
- Room-nights sold per hotel, per season
- Average stay duration by guest country
- Cancellation rates by period and hotel category
SCD2-specific analysis:
- Compare revenue performance of hotels before and after star rating upgrade
- Identify which hotel version (chain affiliation) was more profitable
Guest origin analysis:
- Which countries generate the most bookings and revenue
- Cross-country booking patterns (guest country vs hotel country)
Limitations and Possible Extensions
| Limitation | Possible extension |
|---|---|
| Static OLTP data (no live updates) | Add a NiFi timer to simulate ongoing bookings |
| No SCD2 on DIM_ROOM | Add room type tracking for renovation analysis |
| Single fact table | Add a second fact table for daily hotel occupancy (snapshot fact) |
| No data quality checks in NiFi | Add RouteOnAttribute + dead-letter queue for failed records |
| Oracle target is university lab | Package with Oracle XE Docker container for self-contained demo |