This commit is contained in:
2026-05-17 21:27:42 +02:00
parent 718407d709
commit 571c749e25
5 changed files with 875 additions and 0 deletions

181
docs/04-setup.md Normal file
View File

@@ -0,0 +1,181 @@
# Setup Guide
## Prerequisites
| Tool | Required for | Notes |
|------|-------------|-------|
| Docker or Podman | MySQL container | Use `--podman` flag on Linux |
| .NET 10 SDK | Data generator | `dotnet run file.cs` support |
| Apache NiFi | ETL | Running instance with Oracle + MySQL JDBC drivers |
| Oracle JDBC driver | NiFi | `ojdbc11.jar` in NiFi's lib directory |
| MySQL JDBC driver | NiFi | `mysql-connector-j-*.jar` in NiFi's lib directory |
| Oracle DB access | Data mart target | University lab credentials |
---
## Step 1 — Start MySQL Container
**Linux / macOS (Docker):**
```bash
bash docker/start.sh
```
**Linux / macOS (Podman):**
```bash
bash docker/start.sh --podman
```
**Windows (PowerShell):**
```powershell
.\docker\start.ps1
```
The script:
- Creates a named container `hotel-mysql` with a persistent data volume
- Mounts `sql/schema.sql` as an init script — all 13 tables are created automatically on first start
- Waits until MySQL is ready before exiting
**Connection details:**
```
Host: 127.0.0.1
Port: 13306
Database: hotel_reservations
User: root
Password: hotel2025root
```
---
## Step 2 — Generate OLTP Data
```bash
dotnet run generator/generate.cs
```
**Runtime:** ~3 minutes
**Output:** 635,000+ rows across 13 tables
The generator is deterministic (fixed seed `42`) — running it twice on an empty database produces the same data.
> **Important:** Run the generator only once on an empty database. If you need to restart, truncate all tables first (respecting FK order) or drop and recreate the container + volume.
### Quick table verification after generation:
```bash
# Docker
docker exec hotel-mysql mysql -uroot -photel2025root hotel_reservations \
-e "SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema='hotel_reservations';"
# Podman
podman exec hotel-mysql mysql -uroot -photel2025root hotel_reservations \
-e "SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema='hotel_reservations';"
```
---
## Step 3 — Prepare Oracle Data Mart
Connect to the Oracle schema (university lab) and execute `sql/datamart_schema.sql`.
The script creates:
- `ETL_WATERMARK` (with initial row for `FACT_ROOM_BOOKING`)
- `STG_HOTEL` (staging)
- All 7 dimension tables
- `FACT_ROOM_BOOKING`
```sql
-- Run in SQL*Plus or SQL Developer:
@datamart_schema.sql
```
---
## Step 4 — Configure NiFi
### 4.1 Add JDBC drivers to NiFi
Copy the following JARs to `$NIFI_HOME/lib/` (or the NiFi extensions directory):
- `mysql-connector-j-8.x.jar`
- `ojdbc11.jar`
Restart NiFi after adding drivers.
### 4.2 Create Controller Services
In NiFi UI → Controller Settings → Controller Services:
**MySQL connection:**
- Type: `DBCPConnectionPool`
- Database Driver Class Name: `com.mysql.cj.jdbc.Driver`
- Database Connection URL: `jdbc:mysql://127.0.0.1:13306/hotel_reservations`
- Database User: `root`
- Password: `hotel2025root`
**Oracle connection:**
- Type: `DBCPConnectionPool`
- Database Driver Class Name: `oracle.jdbc.OracleDriver`
- Database Connection URL: `jdbc:oracle:thin:@<host>:1521:<sid>`
- Database User: `<your_schema>`
- Password: `<your_password>`
Enable both services.
### 4.3 Build Process Groups
Follow the detailed processor configuration in `docs/nifi-flow.md`.
**Recommended build order:**
1. PG-1: Date Dimension (simplest, test first)
2. PG-2: Static Dimensions (verify MERGE logic)
3. PG-3: DIM_HOTEL SCD2 (most complex — check staging table after run)
4. PG-4: DIM_GUEST SCD1
5. PG-5: Fact Incremental Load
---
## Step 5 — Run ETL
### First full load
1. Run **PG-1** (Date Dimension) manually — run once
2. Start **PG-2, PG-3, PG-4** — these are idempotent, safe to re-run
3. Start **PG-5** — runs incrementally; first run loads all 531k room_bookings
### Verify load
```sql
-- Oracle
SELECT COUNT(*) FROM DIM_HOTEL; -- should be 200 (+ more after SCD2 changes)
SELECT COUNT(*) FROM DIM_GUEST; -- 100,000
SELECT COUNT(*) FROM FACT_ROOM_BOOKING; -- 531,382
SELECT last_key FROM ETL_WATERMARK WHERE entity_name = 'FACT_ROOM_BOOKING'; -- 531,382
```
### Verify SCD2 is working
```sql
-- Should show 1 current version per hotel on initial load
SELECT is_current, COUNT(*) FROM DIM_HOTEL GROUP BY is_current;
-- Expected: IS_CURRENT=1, COUNT=200
```
---
## Stop / Restart
**Stop MySQL (preserves data):**
```bash
bash docker/stop.sh [--podman]
```
**Restart MySQL:**
```bash
bash docker/start.sh [--podman]
```
**Full reset (delete all data):**
```bash
bash docker/stop.sh --podman
podman volume rm hotel-mysql-data
bash docker/start.sh --podman
dotnet run generator/generate.cs
```