Overview
Add runtime detection and adaptive code generation to support all vector widths (128, 256, 512 bits) based on available hardware. Currently hardcoded to Vector256 only.
Parent issue: #545
Current State
// ILKernelGenerator.cs - HARDCODED to Vector256
private static int GetVectorCount<T>() => Vector256<T>.Count;
EmitVectorLoad() → typeof(Vector256).GetMethod("Load", ...)
EmitVectorStore() → typeof(Vector256).GetMethod("Store", ...)
EmitVectorOperation() → typeof(Vector256<T>).GetMethod("op_Addition", ...)
Problem
| Hardware |
Vector Support |
Current NumSharp |
Issue |
| Intel Xeon, AMD Zen4 |
V512 ✓ |
Uses V256 |
Missing 2× speedup |
| Most consumer CPUs |
V256 ✓ |
Uses V256 |
OK |
| Older CPUs, ARM |
V128 only |
Crashes or falls back to scalar |
No SIMD benefit |
| No SIMD |
None |
Falls back to scalar |
OK |
Solution: Runtime Detection + Parameterized Emission
Step 1: Detect Hardware Once at Startup
public static class ILKernelGenerator
{
/// <summary>
/// Detected vector width at startup. Checked once, used forever.
/// 512, 256, 128, or 0 (no SIMD).
/// </summary>
public static readonly int VectorBits =
Vector512.IsHardwareAccelerated ? 512 :
Vector256.IsHardwareAccelerated ? 256 :
Vector128.IsHardwareAccelerated ? 128 : 0;
}
Step 2: Parameterize Helper Methods
/// <summary>
/// Get element count for current hardware's vector width.
/// </summary>
private static int GetVectorCount(NPTypeCode type)
{
int typeSize = GetTypeSize(type);
return VectorBits / (typeSize * 8); // bits / bits-per-element
}
// VectorBits=512, Int32 → 512/32 = 16 elements
// VectorBits=256, Int32 → 256/32 = 8 elements
// VectorBits=128, Int32 → 128/32 = 4 elements
/// <summary>
/// Get the Vector container type (Vector128, Vector256, or Vector512).
/// </summary>
private static Type GetVectorContainerType() => VectorBits switch
{
512 => typeof(Vector512),
256 => typeof(Vector256),
128 => typeof(Vector128),
_ => throw new NotSupportedException("No SIMD support")
};
/// <summary>
/// Get the Vector<T> generic type for current width.
/// </summary>
private static Type GetVectorType(Type elementType) => VectorBits switch
{
512 => typeof(Vector512<>).MakeGenericType(elementType),
256 => typeof(Vector256<>).MakeGenericType(elementType),
128 => typeof(Vector128<>).MakeGenericType(elementType),
_ => throw new NotSupportedException("No SIMD support")
};
/// <summary>
/// Check if SIMD is available for this type.
/// </summary>
private static bool CanUseSimd(NPTypeCode type)
{
if (VectorBits == 0) return false; // No SIMD hardware
return type switch
{
NPTypeCode.Byte => true,
NPTypeCode.Int16 or NPTypeCode.UInt16 => true,
NPTypeCode.Int32 or NPTypeCode.UInt32 => true,
NPTypeCode.Int64 or NPTypeCode.UInt64 => true,
NPTypeCode.Single or NPTypeCode.Double => true,
_ => false // Boolean, Char, Decimal - no SIMD
};
}
Step 3: Update Emit Methods
private static void EmitVectorLoad(ILGenerator il, NPTypeCode type)
{
var containerType = GetVectorContainerType(); // Vector128/256/512
var elementType = GetClrType(type);
var loadMethod = containerType
.GetMethod("Load", BindingFlags.Public | BindingFlags.Static)
.MakeGenericMethod(elementType);
il.EmitCall(OpCodes.Call, loadMethod, null);
}
private static void EmitVectorStore(ILGenerator il, NPTypeCode type)
{
var containerType = GetVectorContainerType();
var elementType = GetClrType(type);
var storeMethod = containerType
.GetMethods(BindingFlags.Public | BindingFlags.Static)
.First(m => m.Name == "Store" &&
m.GetParameters().Length == 2 &&
m.GetParameters()[0].ParameterType.IsGenericType)
.MakeGenericMethod(elementType);
il.EmitCall(OpCodes.Call, storeMethod, null);
}
private static void EmitVectorOperation(ILGenerator il, BinaryOp op, NPTypeCode type)
{
var elementType = GetClrType(type);
var vectorType = GetVectorType(elementType); // Vector128<T>/256<T>/512<T>
string methodName = op switch
{
BinaryOp.Add => "op_Addition",
BinaryOp.Subtract => "op_Subtraction",
BinaryOp.Multiply => "op_Multiply",
BinaryOp.Divide => "op_Division",
_ => throw new NotSupportedException()
};
var opMethod = vectorType.GetMethod(methodName,
BindingFlags.Public | BindingFlags.Static,
null, new[] { vectorType, vectorType }, null);
il.EmitCall(OpCodes.Call, opMethod, null);
}
private static void EmitVectorCreate(ILGenerator il, NPTypeCode type)
{
var containerType = GetVectorContainerType();
var elementType = GetClrType(type);
var createMethod = containerType.GetMethod("Create", new[] { elementType });
il.EmitCall(OpCodes.Call, createMethod, null);
}
Step 4: Loop Code Unchanged!
The SIMD loop structure stays exactly the same - only vectorCount changes:
private static void EmitSimdLoop(ILGenerator il, ...)
{
int vectorCount = GetVectorCount(resultType); // 4, 8, or 16
// vectorEnd = totalSize - vectorCount
il.Emit(OpCodes.Ldarg, totalSizeArg);
il.Emit(OpCodes.Ldc_I4, vectorCount);
il.Emit(OpCodes.Sub);
il.Emit(OpCodes.Stloc, locVectorEnd);
// SIMD loop - identical structure for V128/V256/V512
il.MarkLabel(lblSimdLoop);
EmitVectorLoad(il, lhsType); // Emits V128/V256/V512.Load
EmitVectorLoad(il, rhsType);
EmitVectorOperation(il, op, resultType);
EmitVectorStore(il, resultType); // Emits V128/V256/V512.Store
// ... increment by vectorCount, loop
}
Task List
Files to Modify
| File |
Changes |
ILKernelGenerator.cs |
Add detection + parameterize ~10 methods |
SimdThresholds.cs |
Adjust thresholds per vector width |
| Tests |
Add V128/V512 path verification |
Expected Results
| Hardware |
VectorBits |
Elements/Vector (int32) |
Speedup vs Scalar |
| No SIMD |
0 |
1 |
1× (baseline) |
| SSE2/NEON |
128 |
4 |
~4× |
| AVX2 |
256 |
8 |
~8× |
| AVX-512 |
512 |
16 |
~16× |
V512 vs V256 Comparison (10M elements)
| Operation |
V256 Time |
V512 Time |
Improvement |
| a + b |
~16 ms |
~8 ms |
2× |
| np.sum |
~5 ms |
~2.5 ms |
2× |
| a * b |
~16 ms |
~8 ms |
2× |
Hardware Coverage
| Vector Width |
CPUs |
| V512 |
Intel Xeon Scalable (Skylake-SP+), AMD EPYC/Ryzen 7000+ (Zen4) |
| V256 |
Intel Core (Haswell+, 2013+), AMD (Excavator+, Zen+) |
| V128 |
All x64 CPUs (SSE2), Apple Silicon (NEON), older AMD |
Implementation Complexity
| Aspect |
Assessment |
| Lines of code |
~80 lines changed |
| Risk |
Low - clean parameterization |
| Testing |
Medium - need V128/V512 path coverage |
| Backwards compatible |
Yes - V256 remains default on most hardware |
Success Criteria
VectorBits correctly detects hardware at startup
- V512 path used automatically on AVX-512 hardware
- V128 path works on older/ARM hardware
- No performance regression on V256 hardware
- All existing tests pass
- Kernel cache works correctly (same key → same kernel)
Overview
Add runtime detection and adaptive code generation to support all vector widths (128, 256, 512 bits) based on available hardware. Currently hardcoded to Vector256 only.
Parent issue: #545
Current State
Problem
Solution: Runtime Detection + Parameterized Emission
Step 1: Detect Hardware Once at Startup
Step 2: Parameterize Helper Methods
Step 3: Update Emit Methods
Step 4: Loop Code Unchanged!
The SIMD loop structure stays exactly the same - only
vectorCountchanges:Task List
VectorBitsstatic readonly field with hardware detectionGetVectorContainerType()helperGetVectorType(Type elementType)helperGetVectorCount()to useVectorBitsCanUseSimd()to checkVectorBits > 0EmitVectorLoad()to use parameterized typesEmitVectorStore()to use parameterized typesEmitVectorOperation()to use parameterized typesEmitVectorCreate()to use parameterized typesSimdThresholdsfor width-appropriate thresholdsFiles to Modify
ILKernelGenerator.csSimdThresholds.csExpected Results
V512 vs V256 Comparison (10M elements)
Hardware Coverage
Implementation Complexity
Success Criteria
VectorBitscorrectly detects hardware at startup