Skip to content

Fix Bug in Workflows#18

Open
cabutlermit wants to merge 4 commits into
mainfrom
timx-633-fix-workflows
Open

Fix Bug in Workflows#18
cabutlermit wants to merge 4 commits into
mainfrom
timx-633-fix-workflows

Conversation

@cabutlermit
Copy link
Copy Markdown
Contributor

@cabutlermit cabutlermit commented May 20, 2026

Purpose and background context

There are two problems that need solutions:

  1. Avoid concurrent runs of a single GitHub Actions workflow
  2. Address the inherent latencies in AWS around Lambda deployment, publishing, and aliases.

Concurrency

By default, on.* triggered workflows will run in parallel, but this is not the desired behavior for our deployment workflows. Instead, in a situation where there is more than one PR merge to `main`` at the same or close-to-same time, we want to ensure that the workflows run sequentially, not in parallel. See GitHub Workflow Syntax: concurrency for more details on how we are attempting to solve this.

For consistency, the concurrency block was added to all three workflows.

AWS Lambda Latencies

The workflows already had a mechanism to wait for Lambda deployment to complete and Lambda publishing to complete (using the aws lambda wait command. However, this wasn't enough. There is also a latency related to assigning an alias to a new published version that wasn't handled and there is no aws lambda wait command related to alias management.

So, we add some bash code to the workflow to run a while loop based poller to monitor the output of the aws lambda get-alias command. There is a RoutingConfig key in the JSON output from that command that only exists while the alias is moving to the new published version. Once the new published version is the only version linked to the alias, the RoutingConfig key disappears from the output. At that point, the while loop exits cleanly and the workflow continues.

How can a reviewer manually see the effects of these changes?

The reviewer can quickly launch the same workflow twice via workflow_dispatch in the UI and see that one job gets queued while the other is still running. I did this for the Dev Build and Deploy workflow and saw this in the UI:
image

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@qltysh
Copy link
Copy Markdown

qltysh Bot commented May 20, 2026

❌ 3 blocking issues (3 total)

Tool Category Rule Count
actionlint Lint unexpected key "queue" for "concurrency" section. expected one of "cancel-in-progress", "group" 3

concurrency:
group: dev-build-deploy
cancel-in-progress: false
queue: single
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unexpected key "queue" for "concurrency" section. expected one of "cancel-in-progress", "group" [actionlint:syntax-check]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the official GitHub documentation for concurrency, queue is an acceptable key:

To allow more than one pending job or workflow run to wait in the same concurrency group, use the optional queue property. The queue property accepts the following values:

  • single (default): At most one job or workflow run can be pending in the concurrency group. When a new job or workflow run is queued, any existing pending job or workflow run in the same group is canceled and replaced.
  • max: Up to 100 jobs or workflow runs can be pending in the concurrency group. When the queue is full, any additional jobs or workflow runs are canceled.

concurrency:
group: prod-deploy
cancel-in-progress: false
queue: single
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unexpected key "queue" for "concurrency" section. expected one of "cancel-in-progress", "group" [actionlint:syntax-check]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the official GitHub documentation for concurrency, queue is an acceptable key:

To allow more than one pending job or workflow run to wait in the same concurrency group, use the optional queue property. The queue property accepts the following values:

  • single (default): At most one job or workflow run can be pending in the concurrency group. When a new job or workflow run is queued, any existing pending job or workflow run in the same group is canceled and replaced.
  • max: Up to 100 jobs or workflow runs can be pending in the concurrency group. When the queue is full, any additional jobs or workflow runs are canceled.

concurrency:
group: stage-build-deploy
cancel-in-progress: false
queue: single
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unexpected key "queue" for "concurrency" section. expected one of "cancel-in-progress", "group" [actionlint:syntax-check]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the official GitHub documentation for concurrency, queue is an acceptable key:

To allow more than one pending job or workflow run to wait in the same concurrency group, use the optional queue property. The queue property accepts the following values:

  • single (default): At most one job or workflow run can be pending in the concurrency group. When a new job or workflow run is queued, any existing pending job or workflow run in the same group is canceled and replaced.
  • max: Up to 100 jobs or workflow runs can be pending in the concurrency group. When the queue is full, any additional jobs or workflow runs are canceled.

Comment thread .github/workflows/dev-build.yml Outdated
Comment thread .github/workflows/dev-build.yml Outdated
Why these changes are being introduced:
Recently, the stage-build workflow has had more than one failure. After
some investigation, the failure seems to be related to a quick sequence
of commits to the main branch from dependabot updates. By default,
`on.*` triggered workflows will run in parallel, but this is not the
desired behavior for our deployment workflows. Instead, in a situation
where there is more than one PR merge to `main`` at the same or
close-to-same time, we want to ensure that the workflows run
sequentially, not in parallel. See
https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax#concurrency
for more details on how we are attempting to solve this.

How this addresses that need:
* Use the concurrency option in the workflows to force the actions to
run sequentially if they are triggered too close together

Side effects of this change:
None.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-633
Why these changes are being introduced:
Turns out that it was more than just workflow concurrency! There are
latencies within AWS related to deploying, publishing, and assigning
aliases for Lambda functions. For each of these actions, the command
completes quickly, but there are behind-the-scenes actions by AWS that
take some time to complete. If we do not wait for the backend processes
to finish, we run the risk of some of the Lambda-related deployment
commands failing. So, in addition to ensuring that the workflows do
not run concurrently, we also add in some additional polling to avoid
failed commands.

How this addresses that need:
* Add a while loop poller to the Update Lambda Alias step in the dev
workflow to prevent the step from completing until the alias has
stabilized on the single new published version

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-633
@cabutlermit cabutlermit force-pushed the timx-633-fix-workflows branch from ccf27e4 to 1c89b1b Compare May 21, 2026 15:36
Comment thread .github/workflows/dev-build.yml Outdated
Why these changes are being introduced:
All the changes have been tested in the dev workflow and passed the
tests, so we can extend the same changes to the stage and prod
workflows.

How this addresses that need:
* Implement the same changes that were made in the dev workflow in the
stage and prod workflows.

Side effects of this change:
None.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-633
@cabutlermit cabutlermit marked this pull request as ready for review May 21, 2026 16:41
@cabutlermit cabutlermit requested a review from JPrevost May 21, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant