Skip to content

Latest commit

 

History

History
144 lines (100 loc) · 11.4 KB

File metadata and controls

144 lines (100 loc) · 11.4 KB

Lafleur Developer Documentation: 05. State and Data Formats

Introduction

During a fuzzing campaign, lafleur generates several files to store its persistent state, track statistics, and save its findings. This document provides a detailed reference for the structure and purpose of each of these files.

For the coverage pipeline that produces this data, see 03. Coverage and Feedback. For the learning engine that consumes mutator scores, see 04. The Mutation Engine. For the analysis tools that read these files, see TOOLING.md.


coverage/coverage_state.pkl

This is the most critical file in the fuzzer. It is a binary file serialized using Python's pickle module and acts as the fuzzer's complete, persistent memory. To save space and reduce memory usage, it uses integer IDs to represent coverage items (UOPs, edges, and rare events). The file is saved atomically to prevent corruption.

It contains a single dictionary with the following top-level keys:

  • uop_map, edge_map, rare_event_map: Forward maps from string representations of coverage items (e.g., "('OPTIMIZED', '_LOAD_ATTR->_STORE_ATTR')") to unique integer IDs.
  • next_id_map: Tracks the next available integer ID for each coverage type ({"uop": N, "edge": N, "rare_event": N}).
  • global_coverage: The master "bitmap" of all unique coverage points ever seen. Maps integer IDs to total hit counts, organized by type ({"uops": {…}, "edges": {…}, "rare_events": {…}}).
  • per_file_coverage: A dictionary where each key is a filename in the corpus (e.g., "123.py") and the value is a rich metadata object. The metadata for each file contains:
    • parent_id: Filename of the parent test case. None for seed files.
    • lineage_depth: How many generations of successful mutations led to this file.
    • content_hash: SHA256 of the file's core code, used for content duplicate detection.
    • coverage_hash: SHA256 of the file's coverage profile, used to detect unique behaviors from non-deterministic code.
    • discovery_time: ISO 8601 timestamp of when the file was added to the corpus.
    • execution_time_ms: Time in milliseconds the test case took to run during discovery.
    • file_size_bytes: Size of the core code in bytes.
    • total_finds: "Fertility" score — number of interesting children this parent has produced.
    • mutations_since_last_find: Counter of consecutive sterile mutations. When this exceeds CORPUS_STERILITY_LIMIT (599), the file is marked permanently sterile.
    • total_mutations_against: Total number of mutation attempts using this file as a parent. Used together with total_finds to compute success rate (total_finds / total_mutations_against) in the lineage tool.
    • is_sterile: Boolean flag set to True when the file has been mutated many times without producing new discoveries. Sterile files receive a heavy penalty in the corpus scheduler.
    • is_pruned: Boolean tombstone marker. Set to True when a corpus file is removed by pruning. Pruned entries retain minimal metadata (parent_id, discovery_mutation, lineage_depth, discovery_time) to preserve lineage chain connectivity for lafleur-lineage visualization.
    • subsumed_children_count: Number of other corpus files subsumed (and pruned) by this one. Generated by the --prune-corpus tool.
    • mutation_seed: The deterministic seed used for the mutation that created this file.
    • baseline_coverage: The coverage this file generated when discovered. A dictionary keyed by harness ID, each containing uops, edges, rare_events (as Counter[int]), and structural metrics trace_length and side_exits.
    • lineage_coverage_profile: The union of all coverage in this file's entire ancestry. Used for relative coverage checks — edges new to the lineage but already known globally.
    • discovery_mutation: A dictionary describing the mutation that created this file:
      • strategy: The mutation strategy used (e.g., "havoc", "sniper").
      • transformers: List of transformer names applied (e.g., ["OperatorSwapMutator", "GuardInjector"]).
      • jit_stats: JIT vitals from the EKG system. Contains absolute metrics (executors, zombie_traces, valid_traces, warm_traces, max_exit_count, max_chain_depth, min_code_size, max_exit_density) and, in session mode, delta metrics (child_delta_max_exit_density, child_delta_total_exits, delta_new_executors, delta_new_zombies). Density values here are clamped and decayed — they represent the target for future generations, not the raw measurement. See 02. The Evolutionary Loop for the clamping/decay mechanism.
      • watched_dependencies: List of global/builtin names that the JIT's Bloom filter indicated were being watched during this execution. Used by the Sniper strategy in future generations.

coverage/mutator_scores.json

This file is the persistent memory for the adaptive mutation engine. It is a human-readable JSON file that stores the learned effectiveness of each mutator and strategy, allowing the fuzzer to resume a campaign with its learned knowledge intact.

  • scores: Maps the name of each mutator or strategy (e.g., "SideEffectInjector", "havoc") to its current floating-point score. Scores are incremented on success and periodically decayed.
  • attempts: Maps each name to an integer count of how many times it has been tried.

Key constants governing the learning model (defined in learning.py):

  • DECAY_INTERVAL = 50 — all scores are multiplied by a decay factor every 50 attempts across all candidates, not on each individual success.
  • WEIGHT_FLOOR = 0.05 — minimum weight to ensure every strategy retains a chance of being selected.
  • min_attempts = 10 — grace period before a candidate's score is used for weighted selection. New mutators get a fair trial before the learning engine judges them.

fuzz_run_stats.json

A human-readable JSON file that acts as a high-level dashboard for the entire fuzzing campaign. Updated after every session.

Key fields:

  • total_sessions: Total number of parents selected for mutation.
  • total_mutations: Total number of child test cases executed.
  • crashes_found: Total crashes discovered.
  • timeouts_found: Total timeouts discovered.
  • new_coverage_finds: Total times a new, unique test case was added to the corpus.
  • sum_of_mutations_per_find: Cumulative sum used to compute average mutations per find.
  • global_seed_counter: Persistent counter ensuring every mutation attempt across all runs has a unique, deterministic seed.
  • corpus_file_counter: Persistent counter for generating unique integer filenames for new corpus files.

logs/run_metadata.json

Generated by lafleur/metadata.py at startup. Captures the identity and environment of the fuzzing instance. Supports identity persistence: if the file already exists, the existing run_id and instance_name are preserved while dynamic fields are updated.

Key sections:

  • Identity: run_id (UUID), instance_name (hostname-based default or user-specified).
  • Hardware: cpu_count_physical, cpu_count_logical, total_ram_gb.
  • Environment: hostname, os (platform string), python_version, python_config_args (the ./configure flags used to build the target CPython — this is how lafleur-report detects JIT and ASAN status).
  • Configuration: Command-line arguments passed to lafleur.

This file is consumed by lafleur-report (header section) and lafleur-campaign (fleet aggregation).


corpus_stats.json

Generated by the TelemetryManager using lafleur/corpus_analysis.py. Contains evolutionary statistics about the corpus, updated periodically during the run.

Key fields:

  • Global counts: total_files, sterile_count, sterile_rate, viable_count.
  • Distributions (each with min/max/mean/median): file_size_distribution, execution_time_distribution, lineage_depth_distribution, mutations_since_find_distribution.
  • Tree topology: root_count (seed files), leaf_count (files that never produced children), max_depth (deepest lineage).
  • Mutation intelligence: successful_strategies and successful_mutators — histograms of which strategies and individual transformers produced the most corpus additions, ranked by frequency.

This file is consumed by lafleur-report (Corpus Evolution section) and is useful for understanding how the fuzzer's strategy mix evolves over a campaign.


logs/timeseries_... .jsonl

A time-series log in JSON Lines (.jsonl) format — each line is a complete, independent JSON object.

Every 10 sessions, the TelemetryManager takes a snapshot of the current fuzz_run_stats.json contents (plus runtime telemetry like system load, RSS memory, disk usage) and appends it as a new line. This creates a detailed historical record of the fuzzer's performance over the course of a run, suitable for graphing or trend analysis.


logs/mutator_effectiveness.jsonl

A time-series log, also in JSON Lines format, that provides telemetry for the adaptive learning engine.

At the same interval as the main timeseries log, the MutatorScoreTracker saves a snapshot containing:

  • timestamp: ISO 8601 timestamp.
  • scores: Complete dictionary of all mutator and strategy scores.
  • attempts: Complete dictionary of attempt counts.
  • success_rates: Calculated score / attempts ratio for each entry.

This file is essential for post-run analysis to understand how the fuzzer's strategy preferences evolved over time.


Output Directories

The fuzzer saves its findings into several output directories relative to the current working directory.

  • corpus/: The main collection of "interesting" test cases that have discovered new JIT coverage. Each file is a standalone Python script. The per_file_coverage entries in coverage_state.pkl provide the metadata for each file.

  • crashes/: Test cases that caused a hard crash (SegFault, ASAN violation, assertion failure) or matched a fatal error keyword.

    • Standard Mode: Individual files like crash_123.py with accompanying .log files.
    • Session Mode: Session Bundles — directories (e.g., session_crash_20250106_120000_1234/) preserving the full context:
      • 00_warmup.py: The parent script that warmed the JIT.
      • 01_script.py (if present): Intermediate polluter scripts.
      • 02_attack.py: The mutated child that triggered the crash.
      • reproduce.sh: Generated script to replay the bundle via python -m lafleur.driver.
      • metadata.json: Crash fingerprint containing type (ASAN, ASSERTION, SIGNAL, etc.), crash_type (enum value), returncode, signal_name, fingerprint (e.g., "ASAN:heap-use-after-free:_PyFrame_Traverse"), and timestamp. This file is what lafleur-triage reads for deduplication and registry management.
  • timeouts/: Test cases that caused the child process to hang beyond the configured timeout. Large log files are automatically compressed with zstd.

  • coverage/: Contains coverage_state.pkl and mutator_scores.json as described above.

  • logs/: Contains run_metadata.json, timeseries JSONL files, mutator_effectiveness.jsonl, and (if --keep-tmp-logs is used) the run_logs/ subdirectory with retained per-mutation log files.