Protobuf and Schemas in PBJ

This document describes how PBJ implements the Protocol Buffers specification. It covers which proto3 features are supported, how PBJ maps protobuf concepts to Java, and where PBJ intentionally deviates from standard protobuf behavior.

For details on the compiler pipeline, see code-generation.md. For codec internals, see codecs.md, codec-protobuf.md, and codec-json.md. For usage examples, see usage-guide.md.

Introduction

PBJ is a proto3-only protobuf implementation. It parses .proto schema files and generates Java source code — model classes, serialization codecs, and tests — that is wire-compatible with Google's protoc compiler.

PBJ was built for the Hiero Consensus Node, a distributed ledger system where serialized bytes are hashed and digitally signed as part of consensus. This use case drives several design decisions that differ from standard protobuf implementations:

Deterministic encoding — identical objects always produce identical bytes, so hashes and signatures are reliable
Explicit nullability — missing fields return null rather than default values, forcing developers to handle absence explicitly
Stable hashCode() and equals() — adding new fields with default values does not break existing hash maps
Performance and minimal garbage — lazy computation, cached sizes, immutable objects, and direct byte-array write paths

PBJ produces the same wire encoding as protoc. Any protobuf message serialized by PBJ can be deserialized by protoc-generated code, and vice versa. The differences are in the Java API surface and serialization guarantees, not in the wire format.

Proto3 Language Support

PBJ supports the following proto3 syntax elements:

Feature	Supported	Notes
Messages	Yes	Including nested messages
Enums	Yes	Including nested enums
Services and RPCs	Yes	All four gRPC call types
Oneof	Yes	Type-safe discriminated union
Map fields	Yes	Deterministic key ordering
Repeated fields	Yes	Packed encoding for numerics
Imports	Yes	Regular, weak, and public
Options	Partial	Standard options plus PBJ-specific extensions
Reserved fields/names	Yes	Parsed and enforced at schema level
Deprecation	Yes	Maps to `@Deprecated` annotation
Documentation comments	Yes	`/** */` comments become Javadoc
Packages	Yes	With custom PBJ package resolution
`optional` keyword	No	See Nullability and Field Presence
`proto2` syntax	No	Proto3 only
Groups	No	Deprecated proto2 feature
Extensions	No	Proto2 feature

Package Resolution

PBJ resolves Java packages for generated code using a priority chain:

PBJ comment option (highest priority): // <<<pbj.java_package = "com.example.package">>> in the proto file
Per-definition PBJ options: pbj.message_java_package, pbj.enum_java_package, or pbj.service_java_package
Standard java_package option + optional javaPackageSuffix from the Gradle plugin
Proto package statement + javaPackageSuffix (fallback)

The javaPackageSuffix is configured in build.gradle.kts:

pbj { javaPackageSuffix = ".pbj" }

This allows PBJ-generated and protoc-generated classes to coexist in the same project under different packages.

Scalar Types

PBJ supports all 15 proto3 scalar types. Each maps to a Java primitive (or String/Bytes) and uses the standard protobuf wire encoding:

Proto type	Java type	Boxed type	Wire type	Encoding
`double`	`double`	`Double`	Fixed 64-bit (1)	IEEE 754 double
`float`	`float`	`Float`	Fixed 32-bit (5)	IEEE 754 float
`int32`	`int`	`Integer`	Varint (0)	Signed varint (negative values use 10 bytes)
`int64`	`long`	`Long`	Varint (0)	Signed varint (negative values use 10 bytes)
`uint32`	`int`	`Integer`	Varint (0)	Unsigned varint
`uint64`	`long`	`Long`	Varint (0)	Unsigned varint
`sint32`	`int`	`Integer`	Varint (0)	ZigZag-encoded varint (efficient for negative values)
`sint64`	`long`	`Long`	Varint (0)	ZigZag-encoded varint (efficient for negative values)
`fixed32`	`int`	`Integer`	Fixed 32-bit (5)	Little-endian 4 bytes
`fixed64`	`long`	`Long`	Fixed 64-bit (1)	Little-endian 8 bytes
`sfixed32`	`int`	`Integer`	Fixed 32-bit (5)	Little-endian 4 bytes, signed
`sfixed64`	`long`	`Long`	Fixed 64-bit (1)	Little-endian 8 bytes, signed
`bool`	`boolean`	`Boolean`	Varint (0)	0 = false, 1 = true
`string`	`String`	`String`	Length-delimited (2)	UTF-8 encoded
`bytes`	`Bytes`	`Bytes`	Length-delimited (2)	Raw bytes

Note that PBJ uses its own Bytes type (an immutable byte-sequence wrapper) rather than byte[] or ByteString. Bytes prevents accidental mutation and supports efficient operations like writeTo(MessageDigest) without copying.

Varint Encoding

Varints are the standard protobuf variable-width integer encoding. Each byte uses 7 bits for data and 1 bit as a continuation flag. Small positive values use fewer bytes (1 byte for 0-127). Negative int32/int64 values are sign-extended to 10 bytes — use sint32/sint64 with ZigZag encoding if negative values are common.

ZigZag Encoding

ZigZag encoding maps signed integers to unsigned integers so that values with small absolute values have small varint encodings: 0 → 0, -1 → 1, 1 → 2, -2 → 3, etc. Used by sint32 and sint64.

Messages and Fields

For each protobuf message, PBJ generates an immutable Java class (not a Java record — see below) with:

Private final fields for each proto field
Getter methods: foo() returns the value or null if absent; fooOrElse(defaultValue) returns a fallback
A Builder inner class with fluent setter methods for construction
Static codec instances: PROTOBUF (binary) and JSON (JSON format)
A DEFAULT singleton with all fields at their default values

Example

Given this proto definition:

message HelloRequest {
  string name = 1;
  int32 count = 2;
}

PBJ generates:

public final class HelloRequest {
    public static final Codec<HelloRequest> PROTOBUF = new HelloRequestProtoCodec();
    public static final JsonCodec<HelloRequest> JSON = new HelloRequestJsonCodec();
    public static final HelloRequest DEFAULT = new HelloRequest("", 0);

    private final String name;
    private final int count;

    // Getters
    public String name() { ... }            // returns null if not present
    public String nameOrElse(String def) { ... }  // returns def if not present
    public int count() { ... }
    public int countOrElse(int def) { ... }

    // Builder
    public static final class Builder {
        public Builder name(String name) { ... }
        public Builder count(int count) { ... }
        public HelloRequest build() { ... }
    }

    public static Builder newBuilder() { ... }
    public Builder copyBuilder() { ... }
}

Why Not Java Records?

PBJ generates regular immutable classes rather than Java records because:

Records cannot have lazy-computed mutable fields. PBJ caches hashCode() and protobufEncodedSize() on first access — these cannot be record components.
Records have limited constructor flexibility. PBJ needs multiple constructor overloads for unknown field handling and internal precomputation.

This decision will be revisited when JEP 526: Lazy Constants is finalized. Lazy constants would allow records to have lazily-computed cached fields like hashCode() and protobufEncodedSize(), removing the primary blocker for record-based generation while preserving the performance optimization.

PBJ also tracks JEP 401: Value Classes and Objects, which could allow generated model objects to be value types that exist on the stack rather than the heap.

Default Values and Wire Semantics

Proto3 defines default values for each type: 0 for numeric types, false for bools, "" for strings, empty bytes for bytes. These defaults have two important consequences in PBJ:

Fields with default values are not serialized on the wire — this is standard proto3 behavior. A message with count = 0 serializes identically to a message where count was never set.
Fields with default values are excluded from hashCode() and equals() — this is a PBJ-specific design choice. It means adding a new field to a message definition does not change the hash of existing objects that don't set the new field. This stability is critical for long-lived hash maps in the Hiero consensus node.

Nullability and Field Presence

This is PBJ's most significant deviation from standard protobuf Java code generation.

Standard Proto3 Behavior

In proto3, there is no distinction between "field was not set" and "field was set to its default value." Google's protoc-generated Java code returns the default value in both cases:

// protoc-generated code
msg.getCount()  // returns 0 whether count was set to 0 or never set
msg.getName()   // returns "" whether name was set to "" or never set

PBJ Behavior

PBJ returns null for fields that were not present on the wire, even for scalar types that use Java primitives:

// PBJ-generated code
msg.name()              // returns null if name was not on the wire
msg.nameOrElse("")      // returns "" if name was not on the wire
msg.count()             // returns 0 (Java primitive — cannot be null)
msg.countOrElse(42)     // returns 42 if count was not on the wire

For primitive fields (int, long, float, double, boolean), the getter returns the Java default (0, 0L, 0.0f, 0.0, false) when absent — Java primitives cannot be null. The fooOrElse() method provides a way to distinguish "not set" from "set to default" for these types.

For reference types (message types and boxed primitive types), the getter returns null when absent. Note that String, Bytes, and similar "primitive-like" types are never null — they follow the same convention as Java primitives, returning empty/default values when absent.

Rationale

This design forces developers to explicitly handle the case where a field is missing, rather than silently receiving a default value. In a consensus system, confusing "not set" with "set to zero" could lead to incorrect state transitions or agreement failures. PBJ's approach, analogous to Java's checked exceptions, makes these edge cases visible at the call site.

Wrapper Types (Optional Value Types)

Proto3 supports wrapper types from google/protobuf/wrappers.proto as a convention for optional scalar values:

Wrapper type	Unwrapped Java type
`google.protobuf.StringValue`	`String` (nullable)
`google.protobuf.Int32Value`	`Integer` (nullable)
`google.protobuf.UInt32Value`	`Integer` (nullable)
`google.protobuf.Int64Value`	`Long` (nullable)
`google.protobuf.UInt64Value`	`Long` (nullable)
`google.protobuf.FloatValue`	`Float` (nullable)
`google.protobuf.DoubleValue`	`Double` (nullable)
`google.protobuf.BoolValue`	`Boolean` (nullable)
`google.protobuf.BytesValue`	`Bytes` (nullable)

PBJ recognizes these wrapper types and generates nullable boxed Java types instead of nested message objects. On the wire, they are still encoded as nested messages (matching protoc), but the generated API presents them as simple nullable values.

Proto3 `optional` Keyword

Proto3 later introduced the optional keyword for explicit field presence (tracking whether a field was explicitly set). PBJ does not currently support the optional keyword. Use wrapper types or oneof fields when you need to distinguish "not set" from "set to default."

Note: Support for the optional keyword is not yet implemented in PBJ.

Enums

PBJ generates a Java enum for each protobuf enum, implementing EnumWithProtoMetadata:

enum Suit {
  SUIT_UNSPECIFIED = 0;
  SUIT_HEARTS = 1;
  SUIT_DIAMONDS = 2;
}

Generates:

public enum Suit implements EnumWithProtoMetadata {
    SUIT_UNSPECIFIED(0),
    SUIT_HEARTS(1),
    SUIT_DIAMONDS(2);

    public int protoOrdinal() { ... }    // wire value
    public String protoName() { ... }    // original proto name

    public static Suit fromProtobufOrdinal(int ordinal) { ... }
    public static Suit fromString(String name) { ... }
}

Key behaviors

First value must be 0 — this is a proto3 requirement and serves as the default value.
Unknown enum values are preserved as raw Integer values rather than mapped to a sentinel constant. This supports forward compatibility: older code can read enum values added in newer schema versions without losing information.
@Deprecated annotations are applied when deprecated = true is set on an enum value in the proto file.
Enum values are encoded as varints on the wire, using their numeric value (not their ordinal position in the Java enum).

Oneof Fields

Protobuf oneof declares a set of fields where at most one can be set at a time. PBJ represents oneofs using a type-safe discriminated union:

message Account {
  oneof staked_id {
    int64 staked_account_id = 1;
    int64 staked_node_id = 2;
  }
}

PBJ generates:

An inner enum for the oneof variants:

public enum StakedIdOneOfType implements EnumWithProtoMetadata {
    UNSET(-1),
    STAKED_ACCOUNT_ID(1),
    STAKED_NODE_ID(2);
}

A OneOf<StakedIdOneOfType> field on the model class:

public OneOf<StakedIdOneOfType> stakedId() { ... }

Typed convenience accessors:

public long stakedAccountId()                   // returns value or default
public boolean hasStakedAccountId()              // presence check
public long stakedAccountIdOrElse(long def)      // with fallback
public long stakedAccountIdOrThrow()             // throws if not set

`OneOf<E>` and `ComparableOneOf<E>`

The OneOf<E> type is a record with two components:

kind() — the discriminator enum value (e.g., STAKED_ACCOUNT_ID, STAKED_NODE_ID, or UNSET)
as() — the value, cast to the appropriate type

ComparableOneOf<E> extends this with Comparable support for fields marked with pbj.comparable.

Wire Format

On the wire, oneof fields are encoded as regular fields — the oneof constraint is not visible in the encoding. If multiple alternatives appear in the wire data, the last one wins (standard protobuf behavior).

Repeated Fields

Repeated fields are the protobuf equivalent of lists/arrays:

message Block {
  repeated Transaction transactions = 1;
  repeated int32 numbers = 2;
}

PBJ generates immutable List<T> fields:

public List<Transaction> transactions() { ... }  // unmodifiable list
public List<Integer> numbers() { ... }            // boxed for generics

Key behaviors

Immutable after construction — lists are backed by UnmodifiableArrayList, a PBJ runtime type that is marked read-only after parsing completes.
Packed encoding for numeric repeated fields — multiple values are concatenated into a single length-delimited field on the wire, reducing overhead. PBJ writes packed encoding and accepts both packed and unpacked on read (per the proto3 spec).
Empty list default — a repeated field that is not present on the wire returns Collections.emptyList(), not null.
Size limits — the parse method enforces maxSize on the number of elements to prevent denial-of-service via extremely large repeated fields.

Maps

Map fields declare key-value associations:

message Config {
  map<string, int32> settings = 1;
}

PBJ generates a PbjMap<K, V> field — an immutable map implementation that supports deterministic key ordering:

public PbjMap<String, Integer> settings() { ... }

Wire format

On the wire, maps are encoded as repeated length-delimited entries, each containing a key (field 1) and value (field 2) sub-field. This is the standard protobuf map encoding.

Deterministic ordering

This is a PBJ-specific guarantee. Standard protobuf does not define map iteration order. PBJ's PbjMap provides a getSortedKeys() method that returns keys in their natural sort order. During serialization, map entries are always written in sorted key order. This ensures:

Identical maps produce identical bytes
Hash computation over serialized maps is deterministic
Digital signatures over messages containing maps are reproducible

Allowed key types

Per the protobuf specification, map keys can be any scalar type except float, double, and bytes. Map values can be any type (scalar, enum, or message) except another map.

Deterministic Encoding

PBJ guarantees deterministic binary encoding: the same logical message always serializes to exactly the same bytes. This is achieved through three ordering rules:

Fields are always written in ascending field number order — the proto3 spec allows fields in any order, but PBJ always writes them sorted.
Map entries are sorted by key — using the natural sort order of the key type (lexicographic for strings, numeric for integers, etc.).
Unknown fields are sorted by field number — unknown fields collected during parsing are written in field-number order during re-serialization.

Why Deterministic Encoding Matters

Standard protobuf explicitly does not guarantee deterministic encoding. The protobuf documentation warns that serialization output may change between library versions and should not be relied upon for hashing or comparison.

PBJ takes the opposite stance because it is designed for a consensus network. In the Hiero consensus node:

Nodes must agree on the hash of a serialized transaction — non-deterministic encoding would cause consensus failures
Digital signatures over serialized messages must be verifiable by any node — different byte orderings would invalidate signatures
State snapshots are hashed for integrity verification — deterministic encoding ensures all nodes compute the same hash

Wire Compatibility

Despite the deterministic ordering, PBJ's wire encoding is fully compatible with protoc. The bytes PBJ produces are valid protobuf and can be parsed by any compliant protobuf implementation. The difference is that PBJ's output is a specific canonical form of the many valid encodings that protobuf allows.

Stable `hashCode()` and `equals()`

PBJ generates hashCode() and equals() implementations with a specific stability guarantee: fields with default values are excluded from the computation.

How It Works

For a message like:

message Account {
  int64 account_id = 1;
  string memo = 2;
  int64 balance = 3;   // added in a later schema version
}

If balance has its default value (0), it is not included in hashCode() or equals(). This means:

An Account object parsed from bytes that predate the balance field has the same hash as an Account object with balance = 0
Existing hash maps that use Account as a key continue to work correctly after the field is added

Rationale

This mirrors the wire format semantics: fields with default values are not encoded on the wire, so they have no presence in the serialized form. PBJ extends this logic to hashCode() and equals() for consistency and to support long-lived data structures across schema evolution.

Lazy Computation

Both hashCode() and the protobuf-encoded size are computed lazily on first access and cached in internal fields ($hashCode and $protobufEncodedSize). This avoids paying the computation cost in constructors, which is important when objects are created in performance-critical paths but may never be hashed or serialized.

Wire Format

PBJ follows the standard Protocol Buffers encoding. Every field on the wire is preceded by a tag encoding both the field number and wire type: tag = (field_number << 3) | wire_type. For detailed wire format internals, tag constants, and codec implementation, see codec-protobuf.md.

Unknown fields can be skipped (default), rejected in strict mode, or collected for round-trip fidelity. Safety limits (maxSize default 2 MB, maxDepth default 512) prevent denial-of-service attacks. See codecs.md for details.

JSON Mapping

PBJ implements the standard proto3 JSON mapping:

Field names are converted from proto snake_case to JSON camelCase (e.g., account_id becomes "accountId")
Only non-default fields are included in JSON output
Enums are serialized as their string name
bytes fields are serialized as base64-encoded strings
64-bit integers (int64, uint64, etc.) are serialized as strings in JSON to avoid JavaScript precision loss
Nested messages are serialized as JSON objects

JSON parsing uses an ANTLR-based parser (JSONParser) that builds a parse tree before walking it. This two-phase approach is simpler than streaming JSON parsing but less suitable for very large payloads.

Both strict and non-strict modes are supported: strict mode throws on unrecognized JSON fields, non-strict mode silently ignores them.

Services and gRPC

PBJ generates Java interfaces for protobuf services:

service Greeter {
  rpc SayHello (HelloRequest) returns (HelloReply);
  rpc SayHelloStream (stream HelloRequest) returns (stream HelloReply);
}

Generates a GreeterServiceInterface with:

Method declarations for each RPC
A Method inner enum listing all RPCs
An open() routing method for dispatching by method enum
Default implementations that throw UnsupportedOperationException

All four gRPC call types are supported:

Call type	Client	Server
Unary	Single request	Single response
Client-streaming	Stream of requests	Single response
Server-streaming	Single request	Stream of responses
Bidirectional	Stream of requests	Stream of responses

PBJ's gRPC implementation runs on Helidon SE (HTTP/2 web client/server) rather than the standard io.grpc library. This eliminates the io.grpc dependency tree and provides low-level access to bytes and HTTP/2 frames.

PBJ-Specific Extensions

PBJ adds several custom options and conventions beyond the standard protobuf specification:

Custom Package Override

// <<<pbj.java_package = "com.example.custom.package">>>

This special comment syntax (not a standard protobuf option) overrides the Java package for all generated classes in the file. It takes highest priority in package resolution.

Per-Definition Package Options

option (pbj.message_java_package) = "com.example.messages";
option (pbj.enum_java_package) = "com.example.enums";
option (pbj.service_java_package) = "com.example.services";

These options override the Java package for specific definition types within a file.

Comparable Fields

option (pbj.comparable) = true;

When set on a message or field, PBJ generates a compareTo() method on the model class, implementing Comparable<T>. Fields are compared in their definition order.

Gradle Plugin Configuration

pbj {
    javaPackageSuffix = ".pbj"          // suffix appended to derived package names
    generateTestClasses = true           // whether to generate unit tests (default: true)
}

Limitations and Unsupported Features

Not Supported

Feature	Reason
Proto2 syntax	PBJ is proto3-only. The ANTLR grammar requires `syntax = "proto3"`
`optional` keyword	Proto3 explicit field presence is not implemented. Use wrapper types or oneof as alternatives
Groups	Deprecated proto2 feature, not part of proto3
Extensions	Proto2 feature, not part of proto3
Custom protobuf options	Only standard options and PBJ-specific `pbj.*` options are recognized
`google.protobuf.Any`	The well-known `Any` type (runtime type embedding) is not handled specially
`google.protobuf.Duration`	Not handled as a special type — parsed as a regular message
`google.protobuf.Timestamp`	Not handled as a special type — parsed as a regular message
`google.protobuf.Struct`	Not handled as a special type — parsed as a regular message

Well-known wrapper types (StringValue, Int32Value, etc.) are supported and receive special treatment as nullable scalar fields.

Long-Term Goals

PBJ is an active project with ongoing development. Some planned areas include:

Support for all protobuf features, including the optional keyword
Auto-mapping gRPC APIs to JSON REST APIs (gRPC transcoding)
Performance optimizations (SIMD-based varint processing)
Support for additional serialization formats

FilesExpand file tree

protobuf-and-schemas.md

Latest commit

History

protobuf-and-schemas.md

File metadata and controls

Protobuf and Schemas in PBJ

Introduction

Proto3 Language Support

Package Resolution

Scalar Types

Varint Encoding

ZigZag Encoding

Messages and Fields

Example

Why Not Java Records?

Default Values and Wire Semantics

Nullability and Field Presence

Standard Proto3 Behavior

PBJ Behavior

Rationale

Wrapper Types (Optional Value Types)

Proto3 optional Keyword

Enums

Key behaviors

Oneof Fields

OneOf<E> and ComparableOneOf<E>

Wire Format

Repeated Fields

Key behaviors

Maps

Wire format

Deterministic ordering

Allowed key types

Deterministic Encoding

Why Deterministic Encoding Matters

Wire Compatibility

Stable hashCode() and equals()

How It Works

Rationale

Lazy Computation

Wire Format

JSON Mapping

Services and gRPC

PBJ-Specific Extensions

Custom Package Override

Per-Definition Package Options

Comparable Fields

Gradle Plugin Configuration

Limitations and Unsupported Features

Not Supported

Long-Term Goals

Proto3 `optional` Keyword

`OneOf<E>` and `ComparableOneOf<E>`

Stable `hashCode()` and `equals()`