This commit is contained in:
2026-05-17 21:27:42 +02:00
parent 718407d709
commit 571c749e25
5 changed files with 875 additions and 0 deletions

83
docs/05-conclusion.md Normal file
View File

@@ -0,0 +1,83 @@
# Conclusion
## What Was Built
This project delivers a complete, working **Data Warehouse pipeline** for the Hotel Reservations domain:
| Layer | What was built | Scale |
|-------|---------------|-------|
| OLTP | MySQL 8.4, 13-table normalized schema | ~635,000 rows |
| Data generation | .NET 10 C# script, realistic seasonal distribution | 500K bookings in ~3 min |
| ETL | Apache NiFi, 5 process groups | full + incremental loads |
| Data Mart | Oracle star schema, SCD Type 2 on DIM_HOTEL | 1 fact + 6 dims |
---
## Design Decisions
### Synthetic data generation instead of a Kaggle dataset
The decision to generate data rather than use a pre-existing dataset was deliberate. Publicly available hotel datasets are either too small (thousands of rows) or lack the normalized relational structure needed to demonstrate a realistic OLTP-to-DW pipeline. The generator produces statistically realistic data:
- Seasonal booking distribution (summer peak, winter trough)
- Realistic stay-length distribution (30% one-night stays)
- Varied status distribution (80% completed, 10% confirmed, 7% cancelled, 3% no-show)
- Revenue rates tied to actual seasonal pricing periods
### SCD Type 2 on DIM_HOTEL only
SCD Type 2 adds operational complexity — it requires staging tables, a two-phase SQL update, and SCD2-aware fact inserts. Applying it to every dimension would make the ETL unnecessarily complex for the analytical benefit gained.
DIM_HOTEL is the right candidate because:
- Star rating changes (3★→4★ after renovation) directly affect revenue benchmarks
- Chain affiliation changes (hotel joins or leaves a franchise) affect chain-level reporting
- Tracking these historically is the core value proposition of dimensional modelling
Guests, countries, room types, and hotel chains all change rarely or in ways that don't affect historical analysis — SCD Type 1 (overwrite) is appropriate.
### Watermark-based incremental fact loading
The fact table uses `source_rb_id` (the MySQL `room_booking_id`) as a natural key and applies a `NOT EXISTS` guard on every insert. Combined with the `ETL_WATERMARK` table, this makes PG-5 both **incremental** (only processes new rows) and **idempotent** (safe to re-run without creating duplicates). This pattern is production-standard and would scale cleanly to a real operational system.
### Integer date keys in DIM_DATE
`date_key` is stored as `NUMBER(8)` in YYYYMMDD format rather than a FK to a DATE column. This allows:
- Fast range predicates: `WHERE checkin_date_key BETWEEN 20240601 AND 20240831`
- No JOIN to get the date value when it's used directly in GROUP BY
- Human-readable values in query results without formatting
---
## Analytical Capabilities
The data mart enables the following categories of OLAP queries:
**Revenue analysis:**
- Total revenue by country, city, hotel chain, star category
- Revenue trend over time (monthly, quarterly, yearly)
- Revenue split by booking status and room type
**Occupancy analysis:**
- Room-nights sold per hotel, per season
- Average stay duration by guest country
- Cancellation rates by period and hotel category
**SCD2-specific analysis:**
- Compare revenue performance of hotels before and after star rating upgrade
- Identify which hotel version (chain affiliation) was more profitable
**Guest origin analysis:**
- Which countries generate the most bookings and revenue
- Cross-country booking patterns (guest country vs hotel country)
---
## Limitations and Possible Extensions
| Limitation | Possible extension |
|------------|-------------------|
| Static OLTP data (no live updates) | Add a NiFi timer to simulate ongoing bookings |
| No SCD2 on DIM_ROOM | Add room type tracking for renovation analysis |
| Single fact table | Add a second fact table for daily hotel occupancy (snapshot fact) |
| No data quality checks in NiFi | Add RouteOnAttribute + dead-letter queue for failed records |
| Oracle target is university lab | Package with Oracle XE Docker container for self-contained demo |