Add LuaJIT interpreter by umanwizard · Pull Request #1236 · open-telemetry/opentelemetry-ebpf-profiler

umanwizard · 2026-03-09T19:31:22Z

This PR contains an interpreter for LuaJIT code, which is primarily useful in practice due to OpenResty. It has been used in Parca/Polar Signals since Nov. 2024.

A thorough explanation of the approach used as well as an introduction to Lua internals can be found at the accompanying blog post: https://www.polarsignals.com/blog/posts/2024/11/13/lua-unwinding .

umanwizard · 2026-03-09T19:51:03Z

Marking as draft until I fix the CI failures and add coredump tests. No major structural changes are expected, so people can feel free to start reviewing if they wish to.

christos68k · 2026-03-09T21:53:57Z

Many thanks!

christos68k · 2026-03-09T21:56:17Z

I have some personal projects using openresty (and also some projects using just their fork of luajit), I'll try to run some workloads and exercise this PR this week but I may run out of time as I'm also flying back to the US, so expect a review from me within a week or two.

This PR contains an interpreter for LuaJIT code, which is primarily useful in practice due to OpenResty. It has been used in Parca/Polar Signals since Nov. 2024. A thorough explanation of the approach used as well as an introduction to Lua internals can be found at the accompanying blog post: https://www.polarsignals.com/blog/posts/2024/11/13/lua-unwinding .

fabled

Awesome stuff! I did first pass on the high level things that seems to stand out questions/discussion.

This might take a while to get fully reviewed as it is huge. Would it make sense to submit in parallel some smaller chunks as separate PRs that we could merge fast to reduce rebase conflicts.

Perhaps as first step just allocate the frame types and add the strings for them etc. Second step could add stub files for the interpreter plugin and ebpf portion - ideally those could just resolve all jit frames as <jit> or similar without any lua specifics (or fail the trace with an unsupported jit error). Then probably rest of the PR will not need to change much during rebases.

fabled · 2026-03-20T11:02:06Z

@@ -0,0 +1,33 @@
+// Copyright 2024 The Parca Authors


Should go to asm/amd. @christos68k @florianl should this be also Copyright The OpenTelemetry Authors ?

Yes - copyright should be Copyright The OpenTelemetry Authors like all other files.

fabled · 2026-03-20T11:14:32Z

+			if unseen {
+				log.Infof("New LuaJIT instance detected: %v", g)
+				if l.ebpf.CoredumpTest() {
+					return interpreter.ErrLJRestart


I don't see that this error is tested for anywhere. Is this something that just triggers a right thing in some error handling path? Could it be clarified somewhere what the intent of this is?

The luajit unwinder isn't a one pass unwinder, it first needs to detect the "g" and use that in userland to extract the JIT trace information, only then can it unwind over of JIT stack frames. This restart stuff was an aborted attempt to add coredump tests for the luajit unwinder. So we'd error out on the first pass after extracting the trace stack size information and the second pass would succeed. It worked but in the end we decided the integration tests gave us enough coverage. We left this in because we anticipated needing to add some coredump tests in order to land this patch so thoughts are welcome on how we should proceed.

fabled · 2026-03-20T11:18:43Z

+// For Lua we need the caller and callee to process a frame.
+// The callee_pt is a pointer to the GCproto of the function being called, the
+// callee_pc is an index into its bytecode. The caller_pt is the
+// GCproto of the calling function and the caller_pc is the index into its
+// bytecode which we will walk backwards in userland to figure out a name for the
+// callee. The callee_pc is for information purposes only, so the user can see where
+// execution was.


So glad the ugly hack of interpreter plugin's Symbolize caching these values internally. I assume this correctly now makes the data cacheable as frames are cached individually instead of per-trace?

fabled · 2026-03-20T11:19:51Z

+func (l *luajitInstance) GetAndResetMetrics() ([]metrics.Metric, error) {
+	return nil, nil
+}
+
+func (l *luajitInstance) ReleaseResources() error {
+	return nil
+}
+
+func (l *luajitInstance) UpdateLibcInfo(ebpf interpreter.EbpfHandler, pid libpf.PID, info libc.LibcInfo) error {
+	return nil
+}


These could be removed if you contain the interpreter.InstanceStubs to inherit the default implementations of these.

fabled · 2026-03-20T11:22:48Z

+#define FRAMES_PER_WALK_LUAJIT_STACK 15
+
+#if defined(__x86_64__)
+  #define DISPATCH r14


As a general note, this file has lot of macro names without luajit specific prefix such as LUAJIT_ or LJ_. While the code seems to compile and not clash with anything, this might be slightly problematic in the coredump test suite for long term maintenance because all the *.ebpf.c files are included in single compulation unit. Trying to minimize macro/function name clashes would be appreciated.

fabled · 2026-03-20T11:23:59Z

+
+///////// END code copied from luajit2 sources.
+
+static inline __attribute__((__always_inline__)) TValue *frame_prevl(TValue *f, TValue frame_val)


We now have EBPF_INLINE that should be used instead of the attribute here.

fabled · 2026-03-20T11:29:55Z

+		return nil, err
+	}
+
+	luaInterp, err := extractInterpreterBounds(info.Deltas(), cframeSize)


Instead of stashing all stack deltas, could this is use the same construct as v8 plugin and use the elfunwindinfo.NewEhFrameTable API? I think it would improve maintainability if this would not directly depend on stack deltas. See interpreters/nodev8/v8.go function locateSnapshotArea. The thing is that stack deltas go through additional manipulation and possible merging. So you could directly look at the FDE sizes and match the FDE using code and/or FDE data.

christos68k

Did a very quick first pass, mostly focused on structure.

christos68k · 2026-03-20T14:42:56Z

+)
+
+// Run
+func TestIntegration(t *testing.T) {


Could we move these integration tests into a separate package (e.g. similar to golabels, interpreter/luajit/integrationtests) and build-qualify them with integration?

christos68k · 2026-03-20T14:46:01Z

@@ -0,0 +1,61 @@
+worker_processes 1;


(Also see my other comment regard the integration test itself)

We should probably move this (and all test .lua files] to a different directory, e.g. interpreter/luajit/integrationtests/testdata

christos68k · 2026-03-20T14:51:35Z

    "description": "BEAM: Ran out of iterations searching for the current code header"
+  },
+  {
+    "id": 7007,


The convention here is to reserve a range of errors e.g. 5xxx for V8, 6xxx for .NET, 7xxx for BEAM:

Suggested change

"id": 7007,

"id": 8000,

christos68k · 2026-03-20T14:54:33Z

@@ -0,0 +1,66 @@
+#!/usr/bin/env python3


Nit: Do we need this checked into the repo? Maybe add a comment to clarify how it can help if so.

github-actions · 2026-04-22T08:52:30Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

dalehamel · 2026-04-24T03:11:38Z

Hey a heads-up, I'm trying to get this working on our openresty based load balancers, and maybe it is a scale issue but i'm running into:

{"time":"2026-04-23T22:48:27.124399773Z","level":"ERROR","msg":"Error adding trace(5250): update: no space left on device"}
{"time":"2026-04-23T22:48:27.12440761Z","level":"ERROR","msg":"Error adding trace(6331): update: no space left on device"}

And if I have the luajit tracer enabled, NO native profiles show up at all (not even kernel idle frames), so I'm wondering if some shared resources is getting completely used up by the unwinder struggling to keep up?

I'm trying to tweak it in Shopify#42, if I can get to the bottom of it I may have some patches to propose. I do think there may be some legitimate map leakage going on as well.

Curious if you've hit this, and at what scale it's been tested at?

umanwizard · 2026-04-24T12:30:49Z

We have several unwinder fixes which I will cherry-pick to this PR either this weekend or on Monday. But I don't think any of them address map leakage; do you know more about when/why that's happening?

dalehamel · 2026-04-24T20:33:21Z

Thanks @umanwizard! We've been running this on production ingress-nginx nodes (OpenResty with LuaJIT 2.1.1710398010) and wanted to share our findings. The good news: profiles come through fine, including Lua-level
frames. The initial issue we reported (zero samples) turned out to be a problem on our ingestion side — our profile ingester didn't have a match arm for the new "luajit" frame kind, so it was silently dropping entire
profile batches (ie, my bad, completely a bug on our end). Once we fixed that, the LuaJIT unwinder worked mostly as expected but there was still a legit issue with map pressure at scale I think.

For context on our workload: our ingress-nginx tier uses OpenResty's Lua scripting extensively — the "Sorting Hat" routing module, dynamic
storefront traffic routing, and per-request plugin chains for scriptable load
balancing. The nodes run multiple nginx worker processes, each with 600-800+ JIT traces, which creates significant
map pressure.

1. `pid_page_to_mapping_info` map pressure (more critical, likely scale issue though)

The main issue we hit at scale is pid_page_to_mapping_info map exhaustion. During SynchronizeMappings, processVMs inserts LPM trie prefixes for each trace's mcode region. With multiple workers × hundreds of traces
each, the entries consumed by LuaJIT trace prefixes alone can use a large fraction of the default 1M-entry map. When it fills up, processNewMapping fails for other PIDs on the node, producing ERROR_4012: native_no_pid_page_mapping errors in the profile.

In logs, this manifests as a flood of ENOSPC errors during trace sync:

{"level":"ERROR","msg":"Error adding trace(6874): update: no space left on device"}
{"level":"ERROR","msg":"Error adding trace(5702): update: no space left on device"}
{"level":"ERROR","msg":"Error adding trace(5302): update: key already exists"}
{"level":"ERROR","msg":"Error adding trace(1578): update: no space left on device"}
{"level":"ERROR","msg":"Error adding trace(3719): update: no space left on device"}
{"level":"ERROR","msg":"Error adding trace(4504): update: no space left on device"}

The flamegraph output looks visually similar with and without the fix (profiles do flow either way), but the profiler logs show a clear difference: with the default 1M map, we see continuous ENOSPC errors and the resulting
ERROR_4012 error frames in the profile. With the raised map size, the ENOSPC errors disappear entirely.

We fixed this by raising pidPageMappingInfoSize from 1 << 20 (1M) to 1 << 22 (4M):

Shopify@c1dc372

Note that tracer/tracer.go also runtime-clamps this map, so both the constant and the loader override need to be raised.

This may not affect smaller deployments, but any environment with many nginx workers and heavy JIT trace activity will likely hit the same cap. However, this is probably contentious as it updates a shared map that is outside of the scope of the luajit interpreter, so this should probably be a separate discussion to be had with the maintainers. The luajit interpreter does seem to be able to hit the problem at scale, though.

2. GCproto symbolization failures (non-critical, informational)

We also observed invalid GCproto object errors during Lua frame symbolization. The proto address captured by the eBPF unwinder (GCfuncL.pc - sizeof(GCproto)) can become stale by the time Go reads it via
process_vm_readv, because LuaJIT's GC can free and recycle the memory in between. In logs this looks like:

{"level":"DEBUG","msg":"symbolization failed for PID 3381553, frame type 13: invalid GCproto object: pt=0x7f9a1354caf4 sizept=32666 end=0x7f9a13554a8e lineinfo=0x6700000066 uvinfo=0x6900000068 varinfo=0x6b0000006a
sizebc=482572562 chunkname=0x6300000062"}
{"level":"DEBUG","msg":"symbolization failed for PID 3381394, frame type 13: invalid GCproto object: pt=0x7f9a1354b4f0 sizept=0 end=0x7f9a1354b4f0 lineinfo=0x0 uvinfo=0x0 varinfo=0x56463cfee770 sizebc=0 chunkname=0x0"}

The sequential byte values in the first example (lineinfo=0x67..., uvinfo=0x69..., chunkname=0x63...) confirm the memory was recycled — these are ASCII characters, not valid pointers. We confirmed via LuaJIT FFI on
the production container that all struct offsets match (sizeof(GCproto)=104, sizeof(GCstr)=24) — this is not a layout or ABI issue, just a timing race.

We experimented with pre-caching GCproto objects during trace sync by reading GCtrace.startpt (at offset 0x40):

Shopify@8b57f24

However, in our A/B/C testing this didn't produce much of a visible difference in the flamegraph — the affected Lua frames show up as placeholder names either way, and the native C functions (lj_gc_step, hashkey,
lua_pcall, ngx_*) are symbolized correctly from libluajit-5.1.so debug symbols regardless. So this fix is not critical — mentioning it here for awareness in case others hit the same debug output and wonder about
it.

A (Shopify/opentelemetry-ebpf-profiler#43, which is just this branch rebased on latest main):

B: A above, + the map fix:

The logs are noisy, but the output looks fine. The difference in shape from A above makes me think that it probably includes traces that were dropped in A due to the map pressure, as B and C look nearly identical.

C: B + the GCproto / GCtrace fixes:

Gets rid of the log issues, but the end result looks identical B otherwise:

Our fork with both fixes is at Shopify/opentelemetry-ebpf-profiler#42.

As your branch becomes ready to review, I'd be happy to test it against our production traffic, or after it lands maybe we can figure out a PR to tweak the map management.

umanwizard requested review from a team as code owners March 9, 2026 19:31

umanwizard marked this pull request as draft March 9, 2026 19:50

umanwizard force-pushed the lua-upstream branch 2 times, most recently from 89efb8a to faca60a Compare March 9, 2026 20:46

umanwizard force-pushed the lua-upstream branch from faca60a to e97a103 Compare March 9, 2026 23:54

umanwizard added 2 commits March 10, 2026 09:12

Report correct PID from LuaJIT unwinder (#229)

a3e40b3

umanwizard force-pushed the lua-upstream branch from e97a103 to a3e40b3 Compare March 10, 2026 16:30

florianl mentioned this pull request Mar 18, 2026

LuaJIT language unwinder #419

Closed

christos68k mentioned this pull request Mar 19, 2026

fix: full call stack if vfork is being used #1203

Open

fabled reviewed Mar 20, 2026

View reviewed changes

christos68k reviewed Mar 20, 2026

View reviewed changes

github-actions Bot added the Stale label Apr 22, 2026

dalehamel mentioned this pull request Apr 23, 2026

Add LuaJIT interpreter on latest main Shopify/opentelemetry-ebpf-profiler#42

Draft

dalehamel mentioned this pull request Apr 24, 2026

Cherry-pick upstream LuaJIT interpreter onto Shopify main Shopify/opentelemetry-ebpf-profiler#43

Draft


		///////// END code copied from luajit2 sources.

		static inline __attribute__((__always_inline__)) TValue frame_prevl(TValue f, TValue frame_val)

Conversation

umanwizard commented Mar 9, 2026

Uh oh!

umanwizard commented Mar 9, 2026

Uh oh!

christos68k commented Mar 9, 2026

Uh oh!

christos68k commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabled left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christos68k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

dalehamel commented Apr 24, 2026

Uh oh!

umanwizard commented Apr 24, 2026

Uh oh!

dalehamel commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. pid_page_to_mapping_info map pressure (more critical, likely scale issue though)

2. GCproto symbolization failures (non-critical, informational)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

christos68k commented Mar 9, 2026 •

edited

Loading

dalehamel commented Apr 24, 2026 •

edited

Loading

1. `pid_page_to_mapping_info` map pressure (more critical, likely scale issue though)