Skip to content

Commit 5e424bf

Browse files
author
Brendan Gray
committed
fix: remove CPU context cap - CPU fallback now gets same RAM-based context as GPU
CPU mode artificially capped context at 8192, causing models that fell back to CPU (e.g. 4B Q8 exceeding 4GB VRAM) to get crushed to ~1792 context via autoContextSizeShrink. With 32GB RAM, a 4B model should get far more. Changes: - Remove 'if (mode === false) maxCtx = Math.min(maxCtx, 8192)' cap - Equalize CPU contextMin to MIN_USABLE_GPU_CONTEXT (8192) - Add diagnostic logging to _computeMaxContext()
1 parent eae9034 commit 5e424bf

1 file changed

Lines changed: 5 additions & 4 deletions

File tree

main/llmEngine.js

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -319,9 +319,8 @@ class LLMEngine extends EventEmitter {
319319
// Now try to create context on this model
320320
const ctxTimeout = mode === false ? CTX_CREATE_TIMEOUT_CPU : CTX_CREATE_TIMEOUT_GPU;
321321
let maxCtx = this._computeMaxContext(gpuConfig.modelSizeGB);
322-
// CPU mode: cap context for responsive generation
323-
if (mode === false) maxCtx = Math.min(maxCtx, 8192);
324-
const contextMin = (mode === false) ? 512 : MIN_USABLE_GPU_CONTEXT;
322+
// CPU mode uses same RAM-based context sizing as GPU — no artificial cap
323+
const contextMin = MIN_USABLE_GPU_CONTEXT;
325324
console.log(`[LLM DIAG] Context creation: mode=${mode}, maxCtx=${maxCtx}, contextMin=${contextMin}, modelSizeGB=${gpuConfig.modelSizeGB.toFixed(2)}`);
326325
loadedContext = await this._withTimeout(
327326
loadedModel.createContext({
@@ -499,7 +498,9 @@ class LLMEngine extends EventEmitter {
499498
const kvPerToken = modelSizeGB < 2 ? 0.5 : modelSizeGB < 8 ? 1.0 : 2.0; // KB per token estimate
500499
const availableForKV = Math.max(0, freeRam - 2 * 1024 ** 3); // reserve 2GB RAM
501500
const maxFromRam = Math.floor(availableForKV / (kvPerToken * 1024));
502-
return Math.min(CONTEXT_ABSOLUTE_CEILING, Math.max(2048, maxFromRam));
501+
const result = Math.min(CONTEXT_ABSOLUTE_CEILING, Math.max(2048, maxFromRam));
502+
console.log(`[LLM] _computeMaxContext: modelSize=${modelSizeGB.toFixed(2)}GB, freeRam=${(freeRam / 1024 ** 3).toFixed(1)}GB, kvPerToken=${kvPerToken}KB, availableForKV=${(availableForKV / 1024 ** 3).toFixed(1)}GB, maxFromRam=${maxFromRam}, result=${result}`);
503+
return result;
503504
}
504505

505506
// ─── Generation ───

0 commit comments

Comments
 (0)