Skip to content

Commit 097cb04

Browse files
committed
Explain the test that checks for file corruption
The test that reproduces our file corruption issue follows a very specific set of steps to actually corrupt the files, and it's probably impossible to figure out why. Explaining those specific steps, however, requires explaining a lot about the internals of `FileMappedDict`, of the files we store, and how corruption happens in the first place, before we can understand why we need those specific steps to reproduce it. I hope this explanation makes sense. Signed-off-by: Daniel Magliola <dmagliola@crystalgears.com>
1 parent adf63e8 commit 097cb04

1 file changed

Lines changed: 95 additions & 3 deletions

File tree

spec/prometheus/client/counter_spec.rb

Lines changed: 95 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,99 @@
143143
end
144144

145145
let(:expected_labels) { [:foo, :bar] }
146-
146+
147+
# Testing for file corruption: this is weird and complicated, so it needs explaining
148+
#
149+
# Files get corrupted when we have two different instances of `FileMappedDict`
150+
# reading and writing the same file. This corruption is expected; we should never have
151+
# two instances of `FileMappedDict` for the same file. If we do, it's a bug in our client.
152+
#
153+
# To clarify, the bug is that *we ended up with two instances for the same file*, not
154+
# that the instances are now corrupting the file.
155+
#
156+
# This is why we're testing this in `with_labels`. It's the only use case we've found
157+
# were we ended up with two instances (before we fixed that bug). `with_labels` is
158+
# incidental, if we find another way to get "duplicate" instances, we should add this
159+
# same exact test, except for the first line, where we need to instead reproduce
160+
# whatever bug gets us that second instance.
161+
#
162+
# The first thing we need to understand is why having two instances of `FileMappedDict`
163+
# corrupts the files:
164+
#
165+
# `FileMappedDict` keeps track, in an internal variable, of how many bytes in the file
166+
# have been used. When adding a new "entry" (observing a new labelset), it serializes
167+
# it and adds it at "the end" (according to its internal byte counter), and it also updates
168+
# the counter at the beginning of the file. However, it never re-reads that counter
169+
# from the file, because there shouldn't be any reason for it to have changed.
170+
#
171+
# If there are two instances pointing to the same file, initially they will both
172+
# share that internal counter, as they do the first read of the file, but if then
173+
# each of them adds an entry, their internal "length" counters will disagree, and
174+
# they'll start overwriting each other's entries.
175+
#
176+
# Importantly, if all of the entries happen to have the same length, it will be "fine".
177+
# Some of the labelsets will effectively disappear, but there will be no corruption,
178+
# because all the important things will fall in the right offsets by pure chance. This
179+
# would be very rare in production, but in a test, it's what normally happens because
180+
# we set all labels to "foo", "bar", etc. This is the reason for "longervalue" below,
181+
# we need to have different labelset lenghts to reproduce the corruption.
182+
#
183+
# With this background about the internals, we can now get to why the specific sequence of
184+
# steps below ends up in corrupted files.
185+
#
186+
# For this to make sense, i'll need to describe the contents of the file at each step.
187+
# I'll represent it like this: `27|labelset1,value1|labelset2,value2|labelset3,value3|`
188+
#
189+
# These are not the bytes we store in the file, but conceptually it's equivalent,
190+
# with two caveats:
191+
# - The counter at the beginning (27 == 3 * 9) here shows the combined length of labelsets.
192+
# It'd normally also include the length of values, but doing that makes this explanation
193+
# much harder to follow.
194+
# - Each entry also starts with a 4-byte int specifying the length of its labelset, so
195+
# we know how much to read. Again, I'm omitting that for readability.
196+
#
197+
#
198+
# Steps to reproduce:
199+
# - We declare `counter` and `counter_with_labels` as a clone. Neither has read the file.
200+
# - We increment `counter`, which creates the file and adds the entry ("labelset1")
201+
# - File: `9|labelset1,value1|`
202+
# - We increment `counter_with_labels`, which reads the file, and adds the new entry
203+
# to it ("muchlongerlabelset2").
204+
# - File: `28|labelset1,value1|muchlongerlabelset2, value2|`
205+
# - `counter` and `counter_with_labels` now disagree about the length of this file
206+
# (`counter` doesn't know the file has grown).
207+
# - We now add a new entry to `counter` ("labelset3"), which thinks the file is shorter
208+
# than it actually is.
209+
# - File: `18|labelset1,value1|labelset3,value3|et2, value2|`
210+
# - The initial counter reflects both labelsets for `counter`; then we have those
211+
# labelsetsp; and finally some "garbage" after the "end" (the garbage is the
212+
# last few bytes of the much longer entry added before by `counter_with_labels`)
213+
# - so far, though, we're still good. If you read the file, all entries are "fine",
214+
# because you're only reading up to the "18" length specified at the beginning.
215+
# - for the problem to manifest itself, we need to increment that counter at the
216+
# beginning, so we'll read the garbage. **BUT**, if we add a new labelset to
217+
# `counter`, it'll overwrite the "garbage" with good data, and the file will
218+
# continue to be fine.
219+
# - We add a new entry to `counter_with_labels`. This updates the length counter at
220+
# the beginning of the file.
221+
# - File: `47|labelset1,value1|labelset3,value3|et2, value2|muchlongerlabelset4, value4|`
222+
#
223+
# - Now the file is properly corrupted. When reading it, `FileMappedDict` sees:
224+
# - labelset1,value1 (cool)
225+
# - labelset3,value3 (cool)
226+
# - et2, value2 (boom)
227+
# |-> the beginning of this entry is garbage because we're actually at the middle
228+
# of an entry, not a beginning.
229+
#
230+
# What actually breaks is that each of these entries is expected to have, at their
231+
# beginning, the length in bytes of its labelset, so we know how much to read.
232+
# Now we have garbage in that position, and `FileMappedDict` will either:
233+
# - Try to interpret those four bytes as a long, get an invalid result.
234+
# - Try to read an invalid amount of data (maybe a negative amount).
235+
# - After reading the labelset, try to read the float and go past the end of the file
236+
# - Actually read what it thinks is a float, try to `unpack` it, and fail because
237+
# it's actually garbage.
238+
# - I'm sure there are other fun ways for it to fail.
147239
it "doesn't corrupt the data files" do
148240
counter_with_labels = counter.with_labels({ foo: 'longervalue'})
149241

@@ -152,8 +244,8 @@
152244
counter_with_labels.increment(by: 2, labels: {bar: 'zzz'})
153245

154246
# After both MetricStores have their files, add a new entry to both
155-
counter.increment(labels: { foo: 'value1', bar: 'aaa'})
156-
counter_with_labels.increment(by: 2, labels: {bar: 'aaa'})
247+
counter.increment(labels: { foo: 'value1', bar: 'aaa'}) # If there's a bug, we partially overwrite { foo: 'longervalue', bar: 'zzz'}
248+
counter_with_labels.increment(by: 2, labels: {bar: 'aaa'}) # Extend the file so we read past that overwrite
157249

158250
expect { counter.values }.not_to raise_error # Check it hasn't corrupted our files
159251
expect { counter_with_labels.values }.not_to raise_error # Check it hasn't corrupted our files

0 commit comments

Comments
 (0)