Conversation
|
Marking as draft until I fix the CI failures and add coredump tests. No major structural changes are expected, so people can feel free to start reviewing if they wish to. |
89efb8a to
faca60a
Compare
|
Many thanks! |
|
I have some personal projects using openresty (and also some projects using just their fork of luajit), I'll try to run some workloads and exercise this PR this week but I may run out of time as I'm also flying back to the US, so expect a review from me within a week or two. |
This PR contains an interpreter for LuaJIT code, which is primarily useful in practice due to OpenResty. It has been used in Parca/Polar Signals since Nov. 2024. A thorough explanation of the approach used as well as an introduction to Lua internals can be found at the accompanying blog post: https://www.polarsignals.com/blog/posts/2024/11/13/lua-unwinding .
e97a103 to
a3e40b3
Compare
fabled
left a comment
There was a problem hiding this comment.
Awesome stuff! I did first pass on the high level things that seems to stand out questions/discussion.
This might take a while to get fully reviewed as it is huge. Would it make sense to submit in parallel some smaller chunks as separate PRs that we could merge fast to reduce rebase conflicts.
Perhaps as first step just allocate the frame types and add the strings for them etc. Second step could add stub files for the interpreter plugin and ebpf portion - ideally those could just resolve all jit frames as <jit> or similar without any lua specifics (or fail the trace with an unsupported jit error). Then probably rest of the PR will not need to change much during rebases.
| @@ -0,0 +1,33 @@ | |||
| // Copyright 2024 The Parca Authors | |||
There was a problem hiding this comment.
Should go to asm/amd. @christos68k @florianl should this be also Copyright The OpenTelemetry Authors ?
There was a problem hiding this comment.
Yes - copyright should be Copyright The OpenTelemetry Authors like all other files.
| if unseen { | ||
| log.Infof("New LuaJIT instance detected: %v", g) | ||
| if l.ebpf.CoredumpTest() { | ||
| return interpreter.ErrLJRestart |
There was a problem hiding this comment.
I don't see that this error is tested for anywhere. Is this something that just triggers a right thing in some error handling path? Could it be clarified somewhere what the intent of this is?
There was a problem hiding this comment.
The luajit unwinder isn't a one pass unwinder, it first needs to detect the "g" and use that in userland to extract the JIT trace information, only then can it unwind over of JIT stack frames. This restart stuff was an aborted attempt to add coredump tests for the luajit unwinder. So we'd error out on the first pass after extracting the trace stack size information and the second pass would succeed. It worked but in the end we decided the integration tests gave us enough coverage. We left this in because we anticipated needing to add some coredump tests in order to land this patch so thoughts are welcome on how we should proceed.
| // For Lua we need the caller and callee to process a frame. | ||
| // The callee_pt is a pointer to the GCproto of the function being called, the | ||
| // callee_pc is an index into its bytecode. The caller_pt is the | ||
| // GCproto of the calling function and the caller_pc is the index into its | ||
| // bytecode which we will walk backwards in userland to figure out a name for the | ||
| // callee. The callee_pc is for information purposes only, so the user can see where | ||
| // execution was. |
There was a problem hiding this comment.
So glad the ugly hack of interpreter plugin's Symbolize caching these values internally. I assume this correctly now makes the data cacheable as frames are cached individually instead of per-trace?
| func (l *luajitInstance) GetAndResetMetrics() ([]metrics.Metric, error) { | ||
| return nil, nil | ||
| } | ||
|
|
||
| func (l *luajitInstance) ReleaseResources() error { | ||
| return nil | ||
| } | ||
|
|
||
| func (l *luajitInstance) UpdateLibcInfo(ebpf interpreter.EbpfHandler, pid libpf.PID, info libc.LibcInfo) error { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
These could be removed if you contain the interpreter.InstanceStubs to inherit the default implementations of these.
| #define FRAMES_PER_WALK_LUAJIT_STACK 15 | ||
|
|
||
| #if defined(__x86_64__) | ||
| #define DISPATCH r14 |
There was a problem hiding this comment.
As a general note, this file has lot of macro names without luajit specific prefix such as LUAJIT_ or LJ_. While the code seems to compile and not clash with anything, this might be slightly problematic in the coredump test suite for long term maintenance because all the *.ebpf.c files are included in single compulation unit. Trying to minimize macro/function name clashes would be appreciated.
|
|
||
| ///////// END code copied from luajit2 sources. | ||
|
|
||
| static inline __attribute__((__always_inline__)) TValue *frame_prevl(TValue *f, TValue frame_val) |
There was a problem hiding this comment.
We now have EBPF_INLINE that should be used instead of the attribute here.
| return nil, err | ||
| } | ||
|
|
||
| luaInterp, err := extractInterpreterBounds(info.Deltas(), cframeSize) |
There was a problem hiding this comment.
Instead of stashing all stack deltas, could this is use the same construct as v8 plugin and use the elfunwindinfo.NewEhFrameTable API? I think it would improve maintainability if this would not directly depend on stack deltas. See interpreters/nodev8/v8.go function locateSnapshotArea. The thing is that stack deltas go through additional manipulation and possible merging. So you could directly look at the FDE sizes and match the FDE using code and/or FDE data.
christos68k
left a comment
There was a problem hiding this comment.
Did a very quick first pass, mostly focused on structure.
| ) | ||
|
|
||
| // Run | ||
| func TestIntegration(t *testing.T) { |
There was a problem hiding this comment.
Could we move these integration tests into a separate package (e.g. similar to golabels, interpreter/luajit/integrationtests) and build-qualify them with integration?
| @@ -0,0 +1,61 @@ | |||
| worker_processes 1; | |||
There was a problem hiding this comment.
(Also see my other comment regard the integration test itself)
We should probably move this (and all test .lua files] to a different directory, e.g. interpreter/luajit/integrationtests/testdata
| "description": "BEAM: Ran out of iterations searching for the current code header" | ||
| }, | ||
| { | ||
| "id": 7007, |
There was a problem hiding this comment.
The convention here is to reserve a range of errors e.g. 5xxx for V8, 6xxx for .NET, 7xxx for BEAM:
| "id": 7007, | |
| "id": 8000, |
| @@ -0,0 +1,66 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Nit: Do we need this checked into the repo? Maybe add a comment to clarify how it can help if so.
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Hey a heads-up, I'm trying to get this working on our openresty based load balancers, and maybe it is a scale issue but i'm running into: And if I have the luajit tracer enabled, NO native profiles show up at all (not even kernel idle frames), so I'm wondering if some shared resources is getting completely used up by the unwinder struggling to keep up? I'm trying to tweak it in Shopify#42, if I can get to the bottom of it I may have some patches to propose. I do think there may be some legitimate map leakage going on as well. Curious if you've hit this, and at what scale it's been tested at? |
|
We have several unwinder fixes which I will cherry-pick to this PR either this weekend or on Monday. But I don't think any of them address map leakage; do you know more about when/why that's happening? |
|
Thanks @umanwizard! We've been running this on production ingress-nginx nodes (OpenResty with LuaJIT 2.1.1710398010) and wanted to share our findings. The good news: profiles come through fine, including Lua-level For context on our workload: our ingress-nginx tier uses OpenResty's Lua scripting extensively — the "Sorting Hat" routing module, dynamic 1.
|



This PR contains an interpreter for LuaJIT code, which is primarily useful in practice due to OpenResty. It has been used in Parca/Polar Signals since Nov. 2024.
A thorough explanation of the approach used as well as an introduction to Lua internals can be found at the accompanying blog post: https://www.polarsignals.com/blog/posts/2024/11/13/lua-unwinding .