Skip to content

[Reproducing baseline performance] - 1x RTX4090, Gradient Accumulation #121

@sonjt00

Description

@sonjt00

Hello,

I'm currently trying to reproduce your baseline performance (ResNet-50) using a single RTX 4090 GPU.

To achieve the effective batch size of 48 (6 * 8) that you used,
I'm using a batch size of 6 with gradient accumulation over 8 steps.
Accordingly, I've increased the number of iterations by 8 times as well.

While gradient accumulation has helped me avoid "NaN" and "Infinity" issues in grad_norm and loss values,

I'm still facing gradient explosion around epoch 10 out of 100.

I've attempted to tune hyperparameters such as
grad_norm, learning rate, and weight_decay,
but the gradient explosion issue persists.

Has anyone else encountered this problem?

I would greatly appreciate any advice on suitable hyperparameter settings for a single RTX 4090 with gradient accumulation.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions