Skip to content

Add eval for explaining Collectors.toCollection choices #46

Description

@martinfrancois

Problem

The Java Streams skill should treat collect(Collectors.toCollection(...)) as a collector choice that needs justification. It is not always wrong, but it is more specific than ordinary collection, so the agent should verify that the concrete implementation matters and either simplify the collector or explain the reason in code.

This is eval-worthy because the code can be functionally correct while still being unclear: a reader cannot tell whether HashSet, ArrayList, or LinkedHashSet was chosen for mutability, ordering, membership checks, or by habit.

Code before the prompt was executed

The code had several toCollection(...) uses without comments. Some were justified but unexplained:

List<Card> normalized = boardCards.stream()
        .map(card -> normalize(card, context, config))
        .flatMap(Optional::stream)
        .filter(card -> isTerminal(card, config))
        .collect(Collectors.toCollection(ArrayList::new));

Set<String> archivedListIds = context.lists().values().stream()
        .filter(BoardList::closed)
        .map(BoardList::id)
        .collect(Collectors.toCollection(HashSet::new));

Another site required encounter-order-preserving de-duplication, but that reason was not visible:

SequencedSet<Path> unconnectedWorkflowPaths = separateUnconnectedWorkflows
        ? reportWorkflowPaths(selectedManifest, selection.workflow(), paths.configDir(), true).stream()
                .filter(workflow -> selectedWorkflowPaths.stream()
                        .noneMatch(connected -> PathsEqual.samePath(connected, workflow)))
                .collect(Collectors.toCollection(LinkedHashSet::new))
        : new LinkedHashSet<>();

The stream-formatting regression test also used toCollection(...) in sample text even though the sample only needed any collector-style terminal operation:

String source = "var labels = card.labels().stream().map(StateNames::normalize)"
        + ".collect(Collectors.toCollection(HashSet::new));";

Prompt that caused the implementation

The user asked:

every time you have collect(Collectors.toCollection anywhere, please also think about why it's necessary and add a comment explaining why it was done that way. for example to me it's not obvious why we need collect(Collectors.toCollection(LinkedHashSet::new)) vs just a regular set (i assume to keep the order? i would make it more obvious by adding a comment anyway) also make this durable

Prompt-produced code before maintainer correction

There was no new weak implementation in response to this prompt; the issue was that existing code and one earlier formatter-test sample used toCollection(...) without making the collector choice self-explanatory.

A first attempt simplified one collector to Collectors.toSet(), but that violated the repository's CollectorMutability rule because JDK Collectors.toSet() does not make mutability explicit:

Set<String> archivedListIds = context.lists().values().stream()
        .filter(BoardList::closed)
        .map(BoardList::id)
        .collect(Collectors.toSet());

Why the prompt-produced code is bad

The original toCollection(...) uses were not necessarily incorrect. The problem is that they required readers to infer intent:

  • ArrayList::new was needed because the list is appended to later.
  • HashSet::new was needed at some sites because the code later mutates the set or performs membership checks.
  • LinkedHashSet::new was needed because diagnostics de-duplicate while preserving encounter order.
  • Other HashSet::new uses were not actually necessary and could be simplified.

The intermediate Collectors.toSet() simplification was also not suitable in a codebase that enforces explicit collector mutability.

Maintainer-preferred code

Keep toCollection(...) only where the concrete collection type matters, and add a short rationale comment nearby:

List<Card> normalized = boardCards.stream()
        .map(card -> normalize(card, context, config))
        .flatMap(Optional::stream)
        .filter(card -> isTerminal(card, config))
        // Keep this mutable: archived terminal cards are appended below before de-duplication.
        .collect(Collectors.toCollection(ArrayList::new));
Set<Integer> reserved = manifest.boards().stream()
        .filter(board -> !board.boardId().equals(ignoredBoard.boardId()))
        .map(ConnectedBoard::serverPort)
        // Use a mutable HashSet so file-discovered reservations can be merged below and
        // port scanning can use constant-time membership checks.
        .collect(Collectors.toCollection(HashSet::new));
SequencedSet<Path> unconnectedWorkflowPaths = separateUnconnectedWorkflows
        ? reportWorkflowPaths(selectedManifest, selection.workflow(), paths.configDir(), true).stream()
                .filter(workflow -> selectedWorkflowPaths.stream()
                        .noneMatch(connected -> PathsEqual.samePath(connected, workflow)))
                // Diagnostics should de-duplicate paths while preserving encounter order.
                .collect(Collectors.toCollection(LinkedHashSet::new))
        : new LinkedHashSet<>();

When the concrete implementation does not matter, use an explicit simpler collector or terminal operation instead. In this repository, use Guava immutable collectors rather than ambiguous JDK Collectors.toSet():

Set<String> archivedListIds = context.lists().values().stream()
        .filter(BoardList::closed)
        .map(BoardList::id)
        .collect(toImmutableSet());

For a formatter-test sample that only needs a collector-style terminal operation, avoid unnecessary toCollection(...) entirely:

String source = "var labels = card.labels().stream().map(StateNames::normalize).collect(Collectors.toSet());";

Why the replacement is better

The replacement distinguishes semantic collector requirements from habit:

  • Remaining toCollection(...) sites document mutability, membership, or order requirements.
  • Unnecessary concrete collection choices are removed.
  • The code respects Error Prone/Picnic CollectorMutability by avoiding ambiguous JDK Collectors.toSet() in compiled code.
  • Formatter tests no longer normalize unnecessary product-code collector habits.

Desired eval behavior

  • Reward noticing Collectors.toCollection(...) and asking whether the concrete collection implementation is needed.
  • Reward simplifying unnecessary toCollection(...) uses.
  • Reward preserving toCollection(...) with a concise rationale comment when mutability, encounter order, sorted order, or membership behavior matters.
  • Reward respecting repository rules that prefer explicit mutability, such as Guava immutable collectors over JDK Collectors.toSet().
  • Reward triggering the Streams skill even when the user asks for comments and durability rather than a direct stream refactor.

Anti-patterns the eval should reject

  • Leaving toCollection(...) without explaining why that concrete type matters.
  • Replacing every toCollection(...) mechanically with toSet() or toList() without checking mutability or order requirements.
  • Adding vague comments such as // collect to set that restate the code instead of explaining the contract.
  • Keeping LinkedHashSet::new without saying that encounter order is required.
  • Introducing Collectors.toSet() in a repository that enforces explicit collector mutability.

Suggested eval name

collectors-to-collection-rationale

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions