Skip to content

[FedScale Core] Error handling when network package dropped #208

@continue-revolution

Description

@continue-revolution

What happened + What you expected to happen

I've noticed some common problems when network package dropped in real depolyment, and I have some proposal regarding these problems. I've discussed with @fanlai0990, and I would like to hear from more contributors to figure out the best plan. @mosharaf @AmberLJC @ewenw @IKACE

  1. problem: server->client UPDATE_MODEL package dropped, server->client MODEL_TEST in error (stale model/no model)
    solution: ignore UPDATE_MODEL, send model in MODEL_TEST package
  2. problem: server->client CLIENT_TRAIN package dropped, server->client DUMMY_EVENT forever
    solution: keep event inside queue until client confirm event completed
    pitfall:
    • multi-thread executor may ping the same event more than once
    • UPDATE_MODEL no confirmation, no way to tell if UPDATE_MODEL finished

Versions / Dependencies

fedscale-0.5
server: ubuntu 16
client: android 23

Reproduction script

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions