Skip to content

New failover implementation#8566

Open
pbrezina wants to merge 11 commits into
SSSD:failoverfrom
pbrezina:failover
Open

New failover implementation#8566
pbrezina wants to merge 11 commits into
SSSD:failoverfrom
pbrezina:failover

Conversation

@pbrezina
Copy link
Copy Markdown
Member

This pull request is intended to be a start of a "failover" feature branch where other developers will be able to contribute.

The main failover logic works, compiles and can be tested using a "minimal" provider that is included as an example. The purpose of the "minimal" provider is only to test the failover without the need to port full provider code and itwill be removed prior pushing the contents to the master branch. See how to set it up in minimal-provider-notes.txt and see the switch to new failover in commit minimal: switch to new failover for service lookup and user authentication - this is the minimal set of changes to get it working, but the real port should get and will require more refactoring.

The work is still not finished and there is missing functionality. This functionality, however, can be implemented in small areas of code and should not require larger changes or glues in the whole code base, so this is ready for review. Remaining work is tracked at [1]. Feel free to take any of these tickets and open new tickets when you find something missing.

When reviewing, you can start with src/providers/failover/readme.md that provides high level documentation of the code. And of course do not forget the design page [2].

Thanks, Pavel

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a new failover mechanism for SSSD, introducing prioritized server groups, parallelized candidate server discovery, and a transaction-based API for automated retries. It also provides a minimal provider implementation to demonstrate the new architecture. Critical logic bugs were identified in the server group resolution logic, where duplicate detection causes premature loop exit, and in the address change detection function, which currently returns inverted results.

Comment thread src/providers/failover/failover_group.c Outdated
Comment thread src/providers/failover/failover_server_resolve.c Outdated
Comment thread src/providers/failover/failover_group.c Dismissed
Comment thread src/providers/minimal/minimal_id.c Dismissed
Comment thread src/providers/minimal/minimal_ldap_auth.c Dismissed
Comment thread src/providers/minimal/minimal_id.c Dismissed
Comment thread src/providers/minimal/minimal_id.c Dismissed
Comment thread src/providers/minimal/minimal_id_services.c Dismissed
Comment thread src/providers/minimal/minimal_id_services.c Dismissed
Comment thread src/providers/minimal/minimal_id_services.c Dismissed
@alexey-tikhonov alexey-tikhonov self-assigned this Apr 1, 2026
@alexey-tikhonov alexey-tikhonov added Waiting for review no-backport This should go to target branch only. labels Apr 1, 2026
@alexey-tikhonov
Copy link
Copy Markdown
Member

@pbrezina, is it expected CI fails to build?

src/providers/minimal/minimal_id.c:28:10: fatal error: providers/minimal/minimal.h: No such file or directory

@pbrezina pbrezina force-pushed the failover branch 2 times, most recently from 0570a63 to 2a2c475 Compare April 13, 2026 09:09
@alexey-tikhonov
Copy link
Copy Markdown
Member

alexey-tikhonov commented Apr 13, 2026

@pbrezina,
re: "oidc_child: parameterize entra_idp url" (and other) commits being included in this PR: imo, it's better to rebase base branch (https://github.com/SSSD/sssd/tree/failover) and not current PR in review (https://github.com/pbrezina/sssd/tree/failover)

@alexey-tikhonov
Copy link
Copy Markdown
Member

FreeBSD CI doesn't have required headers installed:


  src/providers/minimal/minimal_ldap_auth.c:32:10: fatal error: 'shadow.h' file not found
     32 | #include <shadow.h>
        |          ^~~~~~~~~~

While 'minimal' isn't going to be merged in main repo branches, this 'fail to build' can hide other issues.

@pbrezina pbrezina force-pushed the failover branch 6 times, most recently from e75bb95 to f84c987 Compare April 13, 2026 12:04
@pbrezina
Copy link
Copy Markdown
Member Author

Now it is fixed. There were missing headers in noinst_HEADERS and some other problems. I reordered the commits and every commit for testing only is clearly marked to not go to master.

Reviewer needs to pay attention only to the "failover" commit, other commits are just for testing and a demostration.

@alexey-tikhonov
Copy link
Copy Markdown
Member

alexey-tikhonov commented Apr 16, 2026

@pbrezina, would it be difficult to include a 'system' test using "minimal" provider and covering any failover scenario?

If it's difficult then disregard as test would be discarded eventually.

Comment thread src/providers/failover/readme.md Outdated
@alexey-tikhonov
Copy link
Copy Markdown
Member

What is missing at this state:

  • server configuration and discovery
    (failover_server_group/batch/vtable_op)
  • server selection mechanism (sss_failover_vtable_op_server_next)
  • kerberos authentication
  • sharing servers between IPA/AD LDAP and KDC
  • online/offline callbacks (resolve callback should not be needed)

Periodic refreshes are also not yet implemented, right?

Comment thread src/providers/failover/failover_transaction.c
Comment thread src/providers/failover/failover_transaction.c
Comment thread src/providers/failover/failover_server.c
Comment thread src/providers/failover/failover_refresh_candidates.c
Comment thread src/providers/failover/failover_vtable.h
@alexey-tikhonov
Copy link
Copy Markdown
Member

I did only overview / preliminary round and not sure at all if any of my comments are valid.
But setting 'changes requested' nonetheless to get a response for my comments to check my understanding so far.

@pbrezina
Copy link
Copy Markdown
Member Author

pbrezina commented May 5, 2026

Hi Alexey, thank you for the first look. I fixed your findings.

Periodic refreshes are also not yet implemented, right?

Right. This will require the callbacks.

@pbrezina, would it be difficult to include a 'system' test using "minimal" provider and covering any failover scenario?

Added test_failover_new__getent_services. Note that the server list is currently hardcoded in the code and not read from configuration.

Comment thread src/providers/minimal/minimal_id_services.c Dismissed
mechanism. The code abstracts automatic server selection, connection management
a retry logic from the backend code. The backend should not touch failover
internals. The main entry port for an operation that needs to contact a remote
server is `sss_failover_transaction_send()`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my porting changes thus far I removed sdap_id_conn_cache but it is used in several places. Will the new framework add support for this instead of starting the entire failover transaction again with sss_failover_transaction_send ?

justin@justin-fedora:~/github/sssd$ grep -irn 'conn_cache' src/
src/providers/ipa/ipa_auth.c:75:                                       state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_dyndns.c:131:    state->sdap_op = sdap_id_op_create(state, sdap_ctx->conn->conn_cache);
src/providers/ipa/ipa_id.c:528:    state->op = sdap_id_op_create(state, state->ctx->conn->conn_cache);
src/providers/ipa/ipa_id.c:1184:    state->op = sdap_id_op_create(state, ctx->conn->conn_cache);
src/providers/ipa/ipa_selinux.c:773:                        selinux_ctx->id_ctx->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_session.c:123:                                       state->sdap_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains.c:2912:                                       sd_ctx->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_ext_groups.c:579:                                       state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_ext_groups.c:936:                                       state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_sudo_async.c:892:                                       sudo_ctx->id_ctx->conn->conn_cache);
src/providers/ipa/ipa_views.c:450:                                       state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_id.c:95:    state->op = sdap_id_op_create(state, state->ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_id.c:451:    state->op = sdap_id_op_create(state, state->ctx->conn->conn_cache);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK sdap_id_conn_cache was a logic to reuse active ldap connection. This is already implemented inside the failover transaction. Each transaction is essentially a replacement for sdap_id_op_create and other sdap_id_op code logic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced sdap_id_op* calls in https://github.com/SSSD/sssd/blob/master/src/providers/ipa/ipa_auth.c#L74-L88 with sss_failover_transaction_send() but this causes the entire failover transaction to start over from scratch when this code is reached (find a working server, kinit, connect to LDAP...).

IIUC the old code re-uses the LDAP connection and does not need to teardown/re-start the connection logic from the beginning.

Unless I missed it, bool reuse_connection in struct sss_failover_transaction_state is not used at all in current failover code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool reuse_connection in struct sss_failover_transaction_state is not used at all in current failover code.

Looks so.

subreq = sdap_cli_connect_send(state, ev, opts, server->uri,
server->addr->sockaddr,
server->addr->sockaddr_len, !read_rootdse,
tls, !authenticate_connection,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this compiler warning:

../src/providers/failover/ldap/failover_ldap_connect.c: In function 'sss_failover_ldap_connect_send':                                                                    
../src/providers/failover/ldap/failover_ldap_connect.c:93:14: warning: 'tls' may be used uninitialized [-Wmaybe-uninitialized]                                           
   93 |     subreq = sdap_cli_connect_send(state, ev, opts, server->uri,                                                                                                 
      |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                 
   94 |                                    server->addr->sockaddr,                                                                                                       
      |                                    ~~~~~~~~~~~~~~~~~~~~~~~                                                                                                       
   95 |                                    server->addr->sockaddr_len, !read_rootdse,                                                                                    
      |                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                    
   96 |                                    tls, !authenticate_connection,                                                                                                
      |                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                
   97 |                                    kinit_expiration_time);                                                                                                       
      |                                    ~~~~~~~~~~~~~~~~~~~~~~                                                                                                        
../src/providers/failover/ldap/failover_ldap_connect.c:53:22: note: 'tls' was declared here                                                                              
   53 |     enum connect_tls tls;                                                                                                                                        
      |                      ^~~    

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition of 'default:' to switch (force_tls) above would fix this.

return false;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on offline implementation it could be nice to add something like to use in place of if (be_is_offline()

bool
sss_failover_active_server_is_offline(struct sss_failover_ctx *fctx)
{
   return fctx->active_server->state == SSS_FAILOVER_SERVER_STATE_OFFLINE;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think sss_failover_is_offline would be a better name. We do not need to bind it to "active server". We might still need to keep be_is_offline that will return true if any of the failover context is offline, which I think is the current behavior: if any failover service fails, the whole backend goes offline. But maybe we can improve it -> we can still provide id lookups (ldap) when authentication (kdc) is down.

justin-stephenson and others added 11 commits May 15, 2026 09:54
…pec file

Add the sssd-minimal provider package to the spec file following the
same pattern as other providers (ldap, ipa, ad, etc.). This packages
the libsss_minimal.so library that was added in recent commits.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
And also disable codeql for the minimal provider. The
provider is for testing only, it does not make sense to
fix any issue there.
This crafts and implements the new failover interface,
it does not provide complete implementation of the failover
mechanism yet. It brings the code to a state were the public
and private interfaces are stable, working and testable so
the following tasks can be split and work on in parallel.

What is missing at this state:
- server configuration and discovery
  (failover_server_group/batch/vtable_op)
- server selection mechanism (sss_failover_vtable_op_server_next)
- kerberos authentication
- sharing servers between IPA/AD LDAP and KDC
- online/offline callbacks (resolve callback should not be needed)

But especially it is possible to start refactoring SSSD code to start
using the new failover implementation.
@alexey-tikhonov
Copy link
Copy Markdown
Member

tests/test_failover_new.py:40: error: Argument "port" to "add" of "LDAPServices" has incompatible type "str"; expected "int"  [arg-type]

@alexey-tikhonov
Copy link
Copy Markdown
Member

'CI' is stuck in 'Queued' state for unclear reason, even 'Cancel' doesn't work...

@alexey-tikhonov
Copy link
Copy Markdown
Member

'CI' is stuck in 'Queued' state for unclear reason, even 'Cancel' doesn't work...

I'll try close/open to see if this helps.

@alexey-tikhonov
Copy link
Copy Markdown
Member

ValueError: Required field 'requirement' is missing for 'tests/test_failover_new.py::test_failover_new__getent_services (minimal)'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changes requested no-backport This should go to target branch only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants