intel-retail · wiwaszko-intel · Jun 17, 2026 · Jun 16, 2026 · Jun 17, 2026
diff --git a/docs/user-guide/order-accuracy/dine-in/get-started.md b/docs/user-guide/order-accuracy/dine-in/get-started.md
@@ -8,9 +8,16 @@ This guide walks you through installation, configuration, and first run of the D
 
 - Docker 24.0+ with Compose V2
 - Intel GPU with drivers installed
-- 16 GB RAM minimum (32 GB recommended)
+- 16 GB RAM minimum (64 GB recommended for production)
 - 50 GB free disk space
 
+> **Notes:**
+> **KV Cache on iGPU / low-RAM systems:** 16 GB RAM is sufficient for **inference**.
+> For first-time model export, a higher-memory host (48–64 GB) is recommended.
+> On iGPU platforms, the KV cache is allocated from **system RAM** — set `export CACHE_SIZE=2`
+> before running `setup_models.sh` to reduce KV cache to 2 GB (default is 4 GB).
+> See [ovms-service/README.md — Tuning the KV Cache Size](https://github.com/intel-retail/order-accuracy/blob/main/ovms-service/README.md#tuning-the-kv-cache-size) for a full per-platform guide.
+
 ```bash
 docker --version
 docker compose version
@@ -154,7 +161,7 @@ Key variables:
 | `BENCHMARK_WORKERS`           | 1       | Concurrent workers     |
 | `BENCHMARK_DURATION`          | 180     | Duration (seconds)     |
 | `BENCHMARK_TARGET_LATENCY_MS` | 25000   | Latency threshold (ms) |
-| `TARGET_DEVICE`               | GPU     | Device: CPU, GPU, NPU  |
+| `TARGET_DEVICE`               | GPU     | Device: CPU, GPU       |
 
 ### Stream Density Test
 
@@ -178,7 +185,7 @@ make plot-metrics          # Generate visualisation plots
 
 ## Changing Inference Device
 
-To switch between GPU, CPU, or NPU, update `TARGET_DEVICE` in `.env` and re-run model setup:
+To switch between GPU and CPU, update `TARGET_DEVICE` in `.env` and re-run model setup:
 
 ```bash
 # In .env
@@ -231,6 +238,17 @@ make help                            # All commands
 
 ---
 
+## Pre-Deployment Checklist
+
+- [ ] Docker and Docker Compose installed and working
+- [ ] Intel GPU drivers installed and GPU visible to Docker
+- [ ] Required ports available (7861, 8083, 8002, 8081, 8084)
+- [ ] At least 50 GB free disk space
+- [ ] **16 GB+ RAM available** (sufficient for inference; for first-time model export 48–64 GB recommended — export on a high-RAM host and copy `ovms-service/models/` to the target system)
+- [ ] VLM model downloaded (`setup_models.sh` completed)
+- [ ] `.env` file created (`make init-env`)
+- [ ] Plate images placed in `images/` and `configs/orders.json` updated
+
 ## Next Steps
 
 - [System Requirements](./get-started/system-requirements.md) - Check the requirements

diff --git a/docs/user-guide/order-accuracy/dine-in/get-started/build-from-source.md b/docs/user-guide/order-accuracy/dine-in/get-started/build-from-source.md
@@ -35,7 +35,7 @@ dine-in/
 │   └── inventory.json               # Known menu items
 ├── images/                          # Test plate images (user-supplied)
 ├── results/                         # Benchmark output
-├── docker-compose.yml
+├── docker-compose.yaml
 ├── Dockerfile                       # python:3.13-slim base
 ├── Makefile
 └── requirements.txt

diff --git a/docs/user-guide/order-accuracy/dine-in/get-started/system-requirements.md b/docs/user-guide/order-accuracy/dine-in/get-started/system-requirements.md
@@ -8,23 +8,35 @@ Hardware, software, and network requirements for deploying Dine-In Order Accurac
 
 ### Development / Single Station
 
-| Component | Requirement                                    |
-| --------- | ---------------------------------------------- |
-| CPU       | 8+ cores                                       |
-| RAM       | 16 GB                                          |
-| GPU       | Intel Arc A770 (16 GB) or equivalent Intel GPU |
-| Storage   | 50 GB SSD                                      |
-
-### Production
-
-| Component | Requirement                                       |
-| --------- | ------------------------------------------------- |
-| CPU       | 16+ cores                                         |
-| RAM       | 32 GB                                             |
-| GPU       | Intel Data Center GPU (for concurrent validation) |
-| Storage   | 200 GB NVMe SSD                                   |
-
-**GPU VRAM guidance:** The Qwen2.5-VL-7B INT8 model requires ~6–8 GB of VRAM.
+| Component | Specification                                                              |
+| --------- | -------------------------------------------------------------------------- |
+| CPU       | 8+ cores                                                                   |
+| RAM       | 16 GB min; 64 GB recommended for production / heavy model export workloads |
+| GPU       | Intel® Arc™ A770 (16 GB) or equivalent Intel GPU                           |
+| Storage   | 50 GB SSD                                                                  |
+
+### Production / Multi-Station
+
+| Component | Specification                                      |
+| --------- | -------------------------------------------------- |
+| CPU       | 16+ cores                                          |
+| RAM       | 64 GB                                              |
+| GPU       | Intel® Data Center GPU (for concurrent validation) |
+| Storage   | 200 GB NVMe SSD                                    |
+
+**GPU VRAM guidance:** The Qwen2.5-VL-7B INT8 model requires ~8 GB of VRAM.
+The default `cache_size=4` reserves an additional 4 GB VRAM for the KV cache. Total VRAM needed
+is around 12 GB, which fits in an Intel® Arc™ A770 16 GB. On **integrated GPU** (iGPU)
+platforms such as Wildcat Lake and Meteor Lake, the KV cache is drawn from **system RAM**
+instead of dedicated VRAM; in such a case, use a smaller value (e.g. `CACHE_SIZE=2`) to avoid
+exhausting system RAM. Set `export CACHE_SIZE=<N>` before running `setup_models.sh`. For a
+full per-platform sizing table and step-by-step instructions see [ovms-service/README.md — Tuning the KV Cache Size](https://github.com/intel-retail/order-accuracy/blob/main/ovms-service/README.md#tuning-the-kv-cache-size).
+
+> **Model Export RAM Note:** 16 GB system RAM is sufficient for **inference-only**
+> deployments. For first-time model export (`setup_models.sh` INT8 quantization), a
+> higher-memory host (48–64 GB recommended) avoids potential OOM and corrupt IR files — export
+> once there and copy `ovms-service/models/` to the target system. If you must export on 16 GB,
+> set `export CACHE_SIZE=2` first. See [ovms-service/README.md — Tuning the KV Cache Size](https://github.com/intel-retail/order-accuracy/blob/main/ovms-service/README.md#tuning-the-kv-cache-size) for details.
 
 ## Software Requirements
 
@@ -41,7 +53,7 @@ Ubuntu 22.04 LTS is the validated platform (matches the `python:3.13-slim` base
 
 ### GPU Drivers
 
-Intel GPU drivers must be installed from <https://dgpu-docs.intel.com/driver/installation.html>. Verify the GPU is accessible to Docker:
+Intel GPU drivers must be installed from [packages.intel.com](https://packages.intel.com). Verify the GPU is accessible to Docker:
 
 ```bash
 ls /dev/dri/
@@ -64,3 +76,11 @@ Expected output includes `GPU`.
 | OVMS VLM          | 8002 | Model inference (external)   |
 | Semantic Service  | 8081 | Semantic matching (external) |
 | Metrics Collector | 8084 | System metrics               |
+
+---
+
+## Next Steps
+
+- [Get Started](../get-started.md) - Set up and run the application
+- [API Reference](../api-reference.md) - REST endpoint documentation
+- [How to Build](./build-from-source.md) - Build from source code
diff --git a/docs/user-guide/order-accuracy/dine-in/how-to-use.md b/docs/user-guide/order-accuracy/dine-in/how-to-use.md
@@ -2,7 +2,7 @@
 
 Guide to using the Dine-In Order Accuracy application features.
 
-> **Note — `TARGET_DEVICE`**: To change the inference device, set `TARGET_DEVICE` in `.env` to `GPU`, `CPU`, or `NPU`, then re-run setup:
+> **Note — `TARGET_DEVICE`**: To change the inference device, set `TARGET_DEVICE` in `.env` to `GPU` or `CPU`, then re-run setup:
 >
 > ```bash
 > cd ../ovms-service && ./setup_models.sh --app dine-in && cd ../dine-in
@@ -257,7 +257,7 @@ Configuration options:
 | `BENCHMARK_INIT_DURATION`     | 60      | Warmup time (seconds)            |
 | `BENCHMARK_MIN_REQUESTS`      | 3       | Min requests before measuring    |
 | `BENCHMARK_REQUEST_TIMEOUT`   | 300     | Request timeout (seconds)        |
-| `TARGET_DEVICE`               | GPU     | Target device: CPU, GPU, NPU     |
+| `TARGET_DEVICE`               | GPU     | Target device: CPU, GPU          |
 | `RESULTS_DIR`                 | results | Output directory                 |
 | `REGISTRY`                    | false   | Use registry images (true/false) |
 

diff --git a/docs/user-guide/order-accuracy/take-away/get-started.md b/docs/user-guide/order-accuracy/take-away/get-started.md
@@ -25,10 +25,19 @@ For detailed hardware and software requirements, see the [System Requirements](.
 | Component | Minimum              | Recommended          |
 | --------- | -------------------- | -------------------- |
 | CPU       | Intel Xeon 8 cores   | Intel Xeon 16+ cores |
-| RAM       | 16GB                 | 32GB+                |
+| RAM       | 16GB                 | 64GB+                |
 | GPU       | Intel Arc A770 (8GB) | Intel Arc            |
 | Storage   | 50GB SSD             | 200GB NVMe           |
 
+> **Note:** **RAM note** 16 GB system RAM is sufficient for **inference**. For first-time model
+> export (`setup_models.sh`), a higher-memory host (48–64 GB recommended) avoids potential OOM
+> — export there and copy `ovms-service/models/` to the target system. 64 GB+ is recommended
+> for production or multi-station deployments.
+
+> **KV Cache on iGPU / low-RAM systems:** On iGPU platforms the KV cache is allocated from
+> **system RAM**. Set `export CACHE_SIZE=2` before running `setup_models.sh` to reduce KV cache
+> to 2 GB (default is 4 GB). See [ovms-service/README.md — Tuning the KV Cache Size](https://github.com/intel-retail/order-accuracy/blob/main/ovms-service/README.md#tuning-the-kv-cache-size) for a full per-platform guide.
+
 ### Software Requirements
 
 | Software         | Version | Purpose               |
@@ -112,7 +121,7 @@ make up
 VLM_BACKEND=ovms
 OVMS_ENDPOINT=http://ovms-vlm:8000
 OVMS_MODEL_NAME=Qwen/Qwen2.5-VL-7B-Instruct
-TARGET_DEVICE=GPU            # 'GPU', 'CPU', or 'NPU' — also set OPENVINO_DEVICE to match
+TARGET_DEVICE=GPU            # 'GPU' or 'CPU' — also set OPENVINO_DEVICE to match
 
 # =============================================================================
 # Inference Device (must match TARGET_DEVICE)
@@ -135,7 +144,7 @@ MINIO_ROOT_PASSWORD=<your-minio-password>
 MINIO_ENDPOINT=minio:9000
 ```
 
-> **Changing the inference device:** Set both `TARGET_DEVICE` and `OPENVINO_DEVICE` to the same value (`GPU`, `CPU`, or `NPU`), then re-run `./setup_models.sh` to re-export the model for that device.
+> **Changing the inference device:** Set both `TARGET_DEVICE` and `OPENVINO_DEVICE` to the same value (`GPU` or `CPU`), then re-run `./setup_models.sh` to re-export the model for that device.
 
 ### Validate Configuration
 
@@ -312,6 +321,17 @@ make benchmark-stream-density  # Run stream density benchmark
 
 ---
 
+## Pre-Deployment Checklist
+
+- [ ] Docker and Docker Compose installed and working
+- [ ] Intel GPU drivers installed and GPU visible to Docker
+- [ ] Required ports available (8000, 7860, 8001, 9000, 9001, 8080)
+- [ ] At least 50 GB free disk space
+- [ ] **16 GB+ RAM available** (sufficient for inference; for first-time model export 48–64 GB recommended — export on a high-RAM host and copy `ovms-service/models/` to the target system)
+- [ ] VLM model downloaded (`setup_models.sh` completed)
+- [ ] `.env` file configured
+- [ ] Camera RTSP URLs accessible from host (parallel mode)
+
 ## Next Steps
 
 - [System Requirements](./get-started/system-requirements.md) - Check the detailed requirements

diff --git a/docs/user-guide/order-accuracy/take-away/get-started/system-requirements.md b/docs/user-guide/order-accuracy/take-away/get-started/system-requirements.md
@@ -6,25 +6,37 @@ Hardware, software, and network requirements for deploying Take-Away Order Accur
 
 ## Hardware Requirements
 
-### Development / Single-Station
+### Development / Single Station
 
-| Component   | Specification                                  |
-| ----------- | ---------------------------------------------- |
-| **CPU**     | 8+ cores                                       |
-| **RAM**     | 16 GB                                          |
-| **GPU**     | Intel Arc A770 (16 GB) or equivalent Intel GPU |
-| **Storage** | 50 GB SSD                                      |
+| Component   | Specification                                                              |
+| ----------- | -------------------------------------------------------------------------- |
+| **CPU**     | 8+ cores                                                                   |
+| **RAM**     | 16 GB min; 64 GB recommended for production / heavy model export workloads |
+| **GPU**     | Intel® Arc™ A770 (16 GB) or equivalent Intel GPU                           |
+| **Storage** | 50 GB SSD                                                                  |
 
 ### Production / Multi-Station
 
-| Component   | Specification                                                  |
-| ----------- | -------------------------------------------------------------- |
-| **CPU**     | 16+ cores                                                      |
-| **RAM**     | 32 GB                                                          |
-| **GPU**     | Intel Data Center GPU Max (48 GB) — for 4+ concurrent stations |
-| **Storage** | 200 GB NVMe SSD                                                |
-
-**GPU VRAM guidance:** The Qwen2.5-VL-7B INT8 model requires ~6–8 GB of VRAM. Reserve at least 8 GB for the VLM; additional VRAM headroom allows more concurrent requests.
+| Component   | Specification                                                   |
+| ----------- | --------------------------------------------------------------- |
+| **CPU**     | 16+ cores                                                       |
+| **RAM**     | 64 GB                                                           |
+| **GPU**     | Intel® Data Center GPU Max (48 GB) — for 4+ concurrent stations |
+| **Storage** | 200 GB NVMe SSD                                                 |
+
+**GPU VRAM guidance:** The Qwen2.5-VL-7B INT8 model requires ~8 GB of VRAM.
+The default `cache_size=4` reserves an additional 4 GB VRAM for the KV cache. Total VRAM needed
+is around 12 GB, which fits in an Intel® Arc™ A770 16 GB. On **integrated GPU** (iGPU)
+platforms such as Wildcat Lake and Meteor Lake, the KV cache is drawn from **system RAM**
+instead of dedicated VRAM; in such a case, use a smaller value (e.g. `CACHE_SIZE=2`) to avoid
+exhausting system RAM. Set `export CACHE_SIZE=<N>` before running `setup_models.sh`. For a
+full per-platform sizing table and step-by-step instructions see [ovms-service/README.md — Tuning the KV Cache Size](https://github.com/intel-retail/order-accuracy/blob/main/ovms-service/README.md#tuning-the-kv-cache-size).
+
+> **Model Export RAM Note:** 16 GB system RAM is sufficient for **inference-only**
+> deployments. For first-time model export (`setup_models.sh` INT8 quantization), a
+> higher-memory host (48–64 GB recommended) avoids potential OOM and corrupt IR files — export
+> once there and copy `ovms-service/models/` to the target system. If you must export on 16 GB,
+> set `export CACHE_SIZE=2` first. See [ovms-service/README.md — Tuning the KV Cache Size](https://github.com/intel-retail/order-accuracy/blob/main/ovms-service/README.md#tuning-the-kv-cache-size) for details.
 
 ---
 
@@ -77,3 +89,11 @@ Expected output includes `GPU`.
 | Codec       | H.264             |
 | Resolution  | 720p–1080p        |
 | Frame Rate  | 15–30 FPS         |
+
+---
+
+## Next Steps
+
+- [Get Started](../get-started.md) - Set up and run the application
+- [API Reference](../api-reference.md) - REST endpoint documentation
+- [How to Build](./build-from-source.md) - Build from source code
diff --git a/docs/user-guide/order-accuracy/take-away/ta-benchmarking.md b/docs/user-guide/order-accuracy/take-away/ta-benchmarking.md
@@ -2,7 +2,7 @@
 
 This guide covers performance testing, stream density benchmarking, and metrics collection for the Take-Away Order Accuracy system.
 
-> **Note — Inference Device**: The default device is `GPU`. To switch to a different device (`CPU` or `NPU`), you must do **both** steps below, otherwise the model will be exported for the wrong device:
+> **Note — Inference Device:** The default device is `GPU`. To switch to `CPU`, you must do **both** steps below, otherwise the model will be exported for the wrong device:
 >
 > 1. Set **both** variables in your `.env` file:
 >