Skip to content

fix(feedgenerator): percent-encode non-ASCII characters in get_tag_uri()#49

Open
balooii wants to merge 1 commit into
getpelican:mainfrom
balooii:fix_tag_uri_compliant
Open

fix(feedgenerator): percent-encode non-ASCII characters in get_tag_uri()#49
balooii wants to merge 1 commit into
getpelican:mainfrom
balooii:fix_tag_uri_compliant

Conversation

@balooii

@balooii balooii commented Jun 2, 2026

Copy link
Copy Markdown

Hello 👋 ,

I noticed that Pelican created a feed that was marked as "does not validate" by W3C Feed Validation Service and traced it to feedgenerator not handing out url encoded uris via get_tag_uri() when non-ASCII characters are used in the path/fragment. This fixes it.


get_tag_uri() now percent-encodes non-ASCII characters in the URL path and fragment so the resulting tag URI is valid as per RFC 4151.

The encoding is
idempotent, so callers that already pass an encoded URL are unaffected

This issue came up while validating the feed from gimp.org (uses pelican to create the feed) via W3C Feed Validation Service (https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fwww.gimp.org%2Ffeeds%2Fatom.xml).

It complains that id is not a valid TAG for this item:

tag:www.gimp.org,2026-02-22:/news/2026/02/22/øyvind-kolås-interview-ww2017/

which uses the non-ascii slug of that page.

Interestingly feedgenerator itself does the url encoding before passing it to get_tag_uri but pelican doesn't do that and passes the raw IRI via writers. So I think it's best to ensure that get_tag_uri() returns a valid tag uri regardless of the input being already encoded or not.

get_tag_uri() now percent-encodes non-ASCII characters in the URL path and
fragment so the resulting tag URI is valid as per RFC 4151.

The encoding is
idempotent, so callers that already pass an encoded URL are unaffected

This issue came up while validating the feed from gimp.org (uses
pelican to create the feed) via W3C Feed Validation Service
(https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fwww.gimp.org%2Ffeeds%2Fatom.xml).

It complains that `id is not a valid TAG` for this item:
```
tag:www.gimp.org,2026-02-22:/news/2026/02/22/øyvind-kolås-interview-ww2017/
```
which uses the non-ascii slug of that page.

Interestingly feedgenerator itself does the url encoding before passing
it to get_tag_uri but pelican doesn't do that and passes the raw IRI via
writers. So I think it's best to ensure that get_tag_uri() returns
a valid tag uri regardless of the input being already encoded or not.
@brunvonlope

Copy link
Copy Markdown

@justinmayer @uda ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants