# Conclusion ## What Was Built This project delivers a complete, working **Data Warehouse pipeline** for the Hotel Reservations domain: | Layer | What was built | Scale | |-------|---------------|-------| | OLTP | MySQL 8.4, 13-table normalized schema | ~635,000 rows | | Data generation | .NET 10 C# script, realistic seasonal distribution | 500K bookings in ~3 min | | ETL | Apache NiFi, 5 process groups | full + incremental loads | | Data Mart | Oracle star schema, SCD Type 2 on DIM_HOTEL | 1 fact + 6 dims | --- ## Design Decisions ### Synthetic data generation instead of a Kaggle dataset The decision to generate data rather than use a pre-existing dataset was deliberate. Publicly available hotel datasets are either too small (thousands of rows) or lack the normalized relational structure needed to demonstrate a realistic OLTP-to-DW pipeline. The generator produces statistically realistic data: - Seasonal booking distribution (summer peak, winter trough) - Realistic stay-length distribution (30% one-night stays) - Varied status distribution (80% completed, 10% confirmed, 7% cancelled, 3% no-show) - Revenue rates tied to actual seasonal pricing periods ### SCD Type 2 on DIM_HOTEL only SCD Type 2 adds operational complexity — it requires staging tables, a two-phase SQL update, and SCD2-aware fact inserts. Applying it to every dimension would make the ETL unnecessarily complex for the analytical benefit gained. DIM_HOTEL is the right candidate because: - Star rating changes (3★→4★ after renovation) directly affect revenue benchmarks - Chain affiliation changes (hotel joins or leaves a franchise) affect chain-level reporting - Tracking these historically is the core value proposition of dimensional modelling Guests, countries, room types, and hotel chains all change rarely or in ways that don't affect historical analysis — SCD Type 1 (overwrite) is appropriate. ### Watermark-based incremental fact loading The fact table uses `source_rb_id` (the MySQL `room_booking_id`) as a natural key and applies a `NOT EXISTS` guard on every insert. Combined with the `ETL_WATERMARK` table, this makes PG-5 both **incremental** (only processes new rows) and **idempotent** (safe to re-run without creating duplicates). This pattern is production-standard and would scale cleanly to a real operational system. ### Integer date keys in DIM_DATE `date_key` is stored as `NUMBER(8)` in YYYYMMDD format rather than a FK to a DATE column. This allows: - Fast range predicates: `WHERE checkin_date_key BETWEEN 20240601 AND 20240831` - No JOIN to get the date value when it's used directly in GROUP BY - Human-readable values in query results without formatting --- ## Analytical Capabilities The data mart enables the following categories of OLAP queries: **Revenue analysis:** - Total revenue by country, city, hotel chain, star category - Revenue trend over time (monthly, quarterly, yearly) - Revenue split by booking status and room type **Occupancy analysis:** - Room-nights sold per hotel, per season - Average stay duration by guest country - Cancellation rates by period and hotel category **SCD2-specific analysis:** - Compare revenue performance of hotels before and after star rating upgrade - Identify which hotel version (chain affiliation) was more profitable **Guest origin analysis:** - Which countries generate the most bookings and revenue - Cross-country booking patterns (guest country vs hotel country) --- ## Limitations and Possible Extensions | Limitation | Possible extension | |------------|-------------------| | Static OLTP data (no live updates) | Add a NiFi timer to simulate ongoing bookings | | No SCD2 on DIM_ROOM | Add room type tracking for renovation analysis | | Single fact table | Add a second fact table for daily hotel occupancy (snapshot fact) | | No data quality checks in NiFi | Add RouteOnAttribute + dead-letter queue for failed records | | Oracle target is university lab | Package with Oracle XE Docker container for self-contained demo |