New failover implementation#8566
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a new failover mechanism for SSSD, introducing prioritized server groups, parallelized candidate server discovery, and a transaction-based API for automated retries. It also provides a minimal provider implementation to demonstrate the new architecture. Critical logic bugs were identified in the server group resolution logic, where duplicate detection causes premature loop exit, and in the address change detection function, which currently returns inverted results.
|
@pbrezina, is it expected CI fails to build? |
0570a63 to
2a2c475
Compare
|
@pbrezina, |
|
FreeBSD CI doesn't have required headers installed: While 'minimal' isn't going to be merged in main repo branches, this 'fail to build' can hide other issues. |
e75bb95 to
f84c987
Compare
|
Now it is fixed. There were missing headers in noinst_HEADERS and some other problems. I reordered the commits and every commit for testing only is clearly marked to not go to master. Reviewer needs to pay attention only to the "failover" commit, other commits are just for testing and a demostration. |
|
@pbrezina, would it be difficult to include a 'system' test using "minimal" provider and covering any failover scenario? If it's difficult then disregard as test would be discarded eventually. |
Periodic refreshes are also not yet implemented, right? |
|
I did only overview / preliminary round and not sure at all if any of my comments are valid. |
|
Hi Alexey, thank you for the first look. I fixed your findings.
Right. This will require the callbacks.
Added |
| mechanism. The code abstracts automatic server selection, connection management | ||
| a retry logic from the backend code. The backend should not touch failover | ||
| internals. The main entry port for an operation that needs to contact a remote | ||
| server is `sss_failover_transaction_send()`. |
There was a problem hiding this comment.
In my porting changes thus far I removed sdap_id_conn_cache but it is used in several places. Will the new framework add support for this instead of starting the entire failover transaction again with sss_failover_transaction_send ?
justin@justin-fedora:~/github/sssd$ grep -irn 'conn_cache' src/
src/providers/ipa/ipa_auth.c:75: state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_dyndns.c:131: state->sdap_op = sdap_id_op_create(state, sdap_ctx->conn->conn_cache);
src/providers/ipa/ipa_id.c:528: state->op = sdap_id_op_create(state, state->ctx->conn->conn_cache);
src/providers/ipa/ipa_id.c:1184: state->op = sdap_id_op_create(state, ctx->conn->conn_cache);
src/providers/ipa/ipa_selinux.c:773: selinux_ctx->id_ctx->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_session.c:123: state->sdap_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains.c:2912: sd_ctx->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_ext_groups.c:579: state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_ext_groups.c:936: state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_sudo_async.c:892: sudo_ctx->id_ctx->conn->conn_cache);
src/providers/ipa/ipa_views.c:450: state->sdap_id_ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_id.c:95: state->op = sdap_id_op_create(state, state->ctx->conn->conn_cache);
src/providers/ipa/ipa_subdomains_id.c:451: state->op = sdap_id_op_create(state, state->ctx->conn->conn_cache);
There was a problem hiding this comment.
AFAIK sdap_id_conn_cache was a logic to reuse active ldap connection. This is already implemented inside the failover transaction. Each transaction is essentially a replacement for sdap_id_op_create and other sdap_id_op code logic.
There was a problem hiding this comment.
I replaced sdap_id_op* calls in https://github.com/SSSD/sssd/blob/master/src/providers/ipa/ipa_auth.c#L74-L88 with sss_failover_transaction_send() but this causes the entire failover transaction to start over from scratch when this code is reached (find a working server, kinit, connect to LDAP...).
IIUC the old code re-uses the LDAP connection and does not need to teardown/re-start the connection logic from the beginning.
Unless I missed it, bool reuse_connection in struct sss_failover_transaction_state is not used at all in current failover code.
There was a problem hiding this comment.
bool reuse_connectioninstruct sss_failover_transaction_stateis not used at all in current failover code.
Looks so.
| subreq = sdap_cli_connect_send(state, ev, opts, server->uri, | ||
| server->addr->sockaddr, | ||
| server->addr->sockaddr_len, !read_rootdse, | ||
| tls, !authenticate_connection, |
There was a problem hiding this comment.
I got this compiler warning:
../src/providers/failover/ldap/failover_ldap_connect.c: In function 'sss_failover_ldap_connect_send':
../src/providers/failover/ldap/failover_ldap_connect.c:93:14: warning: 'tls' may be used uninitialized [-Wmaybe-uninitialized]
93 | subreq = sdap_cli_connect_send(state, ev, opts, server->uri,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
94 | server->addr->sockaddr,
| ~~~~~~~~~~~~~~~~~~~~~~~
95 | server->addr->sockaddr_len, !read_rootdse,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96 | tls, !authenticate_connection,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97 | kinit_expiration_time);
| ~~~~~~~~~~~~~~~~~~~~~~
../src/providers/failover/ldap/failover_ldap_connect.c:53:22: note: 'tls' was declared here
53 | enum connect_tls tls;
| ^~~
There was a problem hiding this comment.
Addition of 'default:' to switch (force_tls) above would fix this.
| return false; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Depending on offline implementation it could be nice to add something like to use in place of if (be_is_offline()
bool
sss_failover_active_server_is_offline(struct sss_failover_ctx *fctx)
{
return fctx->active_server->state == SSS_FAILOVER_SERVER_STATE_OFFLINE;
}
There was a problem hiding this comment.
Yes, I think sss_failover_is_offline would be a better name. We do not need to bind it to "active server". We might still need to keep be_is_offline that will return true if any of the failover context is offline, which I think is the current behavior: if any failover service fails, the whole backend goes offline. But maybe we can improve it -> we can still provide id lookups (ldap) when authentication (kdc) is down.
…ider (cherry picked from commit 0f5f3b6)
so it can be directly modified
…can easily pass new fctx
So it can be modified later.
…kup and user authentication
…pec file Add the sssd-minimal provider package to the spec file following the same pattern as other providers (ldap, ipa, ad, etc.). This packages the libsss_minimal.so library that was added in recent commits. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
And also disable codeql for the minimal provider. The provider is for testing only, it does not make sense to fix any issue there.
This crafts and implements the new failover interface, it does not provide complete implementation of the failover mechanism yet. It brings the code to a state were the public and private interfaces are stable, working and testable so the following tasks can be split and work on in parallel. What is missing at this state: - server configuration and discovery (failover_server_group/batch/vtable_op) - server selection mechanism (sss_failover_vtable_op_server_next) - kerberos authentication - sharing servers between IPA/AD LDAP and KDC - online/offline callbacks (resolve callback should not be needed) But especially it is possible to start refactoring SSSD code to start using the new failover implementation.
|
|
'CI' is stuck in 'Queued' state for unclear reason, even 'Cancel' doesn't work... |
I'll try close/open to see if this helps. |
|
This pull request is intended to be a start of a "failover" feature branch where other developers will be able to contribute.
The main failover logic works, compiles and can be tested using a "minimal" provider that is included as an example. The purpose of the "minimal" provider is only to test the failover without the need to port full provider code and itwill be removed prior pushing the contents to the master branch. See how to set it up in
minimal-provider-notes.txtand see the switch to new failover in commitminimal: switch to new failover for service lookup and user authentication- this is the minimal set of changes to get it working, but the real port should get and will require more refactoring.The work is still not finished and there is missing functionality. This functionality, however, can be implemented in small areas of code and should not require larger changes or glues in the whole code base, so this is ready for review. Remaining work is tracked at [1]. Feel free to take any of these tickets and open new tickets when you find something missing.
When reviewing, you can start with
src/providers/failover/readme.mdthat provides high level documentation of the code. And of course do not forget the design page [2].Thanks, Pavel