Skip to content

util: update AdvancedTlsX509KeyManager to support key alias for reloaded cert#12686

Merged
ejona86 merged 3 commits into
grpc:masterfrom
zhangweikop:tls-key-manager-update
Mar 27, 2026
Merged

util: update AdvancedTlsX509KeyManager to support key alias for reloaded cert#12686
ejona86 merged 3 commits into
grpc:masterfrom
zhangweikop:tls-key-manager-update

Conversation

@zhangweikop
Copy link
Copy Markdown
Contributor

Overview

Make the alias in AdvancedTlsX509KeyManager dynamic so it can be used with Netty's
OpenSslCachingX509KeyManagerFactory to update key material after reload.

Fixes #12670

Problem

When using SslProvider.OPENSSL, each TLS handshake must encode Java key material into a native
buffer consumed by OpenSSL, which can account for ~8% of server CPU. Netty's
OpenSslCachingX509KeyManagerFactory avoids this by caching the encoded buffer keyed by alias —
but the previous implementation always returned "default", so the factory could never detect
credential rotations and create a new cache entry on cert reload.

Details

  • The alias is now set to key-<N> (e.g. key-1, key-2, ...) and incremented on every
    updateIdentityCredentials call, ensuring the same alias always maps to the same key material.
  • A new constructor AdvancedTlsX509KeyManager(int revisionWarningThreshold) allows customizing
    the soft warning threshold (default: 1024, matching OpenSslCachingX509KeyManagerFactory's
    default maxCachedEntries). A warning is logged when the counter reaches the threshold, since
    beyond that point new aliases won't be cached and per-handshake encoding overhead resumes. The
    key manager remains fully functional past this threshold.
  • All alias methods return null before any credentials are loaded.
  • Recommended usage with OpenSSL:
new OpenSslCachingX509KeyManagerFactory(
    new KeyManagerFactoryWrapper(advancedTlsKeyManager))

@zhangweikop zhangweikop force-pushed the tls-key-manager-update branch from a54cf0b to a18d1a6 Compare March 10, 2026 21:38
@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Mar 16, 2026

I don't think using AdvancedTlsX509KeyManager with OpenSslCachingKeyMaterialProvider is all that good an approach with this, because it essentially leaks all prior keys in memory. If the life of the process after a year has 5 versions, that's not that big of a deal. But if it changes ever hour, that's not good. I don't think we're going to put random detection code in AdvancedTlsX509KeyManager to just warn you that things have gone wrong either.

@zhangweikop
Copy link
Copy Markdown
Contributor Author

I don't think using AdvancedTlsX509KeyManager with OpenSslCachingKeyMaterialProvider is all that good an approach with this, because it essentially leaks all prior keys in memory. If the life of the process after a year has 5 versions, that's not that big of a deal. But if it changes ever hour, that's not good. I don't think we're going to put random detection code in AdvancedTlsX509KeyManager to just warn you that things have gone wrong either.

Yes, I think it should be used with caution.
In our case, the PKI-issued certificates have a relatively long lifetime, so over the application process’s three-month lifecycle, reloads occur only a few times.
To support users who require more frequent key rotation, Netty's OpenSslCachingKeyMaterialProvider would need to remove old keys.

As for AdvancedTlsX509KeyManager itself, I think we still need to handle alias changes?

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Mar 23, 2026

As for AdvancedTlsX509KeyManager itself, I think we still need to handle alias changes?

Yes, that's tracked by #12485 . This PR doesn't actually address that, as it ignores the alias being used. I imagined keeping one older version around to handle in-progress handshakes. Since you're talking about every configuration having its own alias name, if the key manager mutates faster than handshakes complete (such that the alias is really old) then we could trigger an error in some way.

In our case, the PKI-issued certificates have a relatively long lifetime, so over the application process’s three-month lifecycle, reloads occur only a few times.

Okay. I'm fine with using a new alias each mutation in this key manager, but that should be consistent; for example, getPrivateKey() shouldn't return the new key when passed the old alias. That applies equally to all methods that receive the alias.

And we won't have the "this sure seems like a lot of mutations" warning.

At that point, if someone so desires they can improve OpenSslCachingKeyMaterialProvider to become an LRU cache (or introduce a new class). Then you could use a cache size of two.

@zhangweikop
Copy link
Copy Markdown
Contributor Author

zhangweikop commented Mar 24, 2026

As for AdvancedTlsX509KeyManager itself, I think we still need to handle alias changes?

Yes, that's tracked by #12485 . This PR doesn't actually address that, as it ignores the alias being used. I imagined keeping one older version around to handle in-progress handshakes. Since you're talking about every configuration having its own alias name, if the key manager mutates faster than handshakes complete (such that the alias is really old) then we could trigger an error in some way.

In our case, the PKI-issued certificates have a relatively long lifetime, so over the application process’s three-month lifecycle, reloads occur only a few times.

Okay. I'm fine with using a new alias each mutation in this key manager, but that should be consistent; for example, getPrivateKey() shouldn't return the new key when passed the old alias. That applies equally to all methods that receive the alias.

And we won't have the "this sure seems like a lot of mutations" warning.

At that point, if someone so desires they can improve OpenSslCachingKeyMaterialProvider to become an LRU cache (or introduce a new class). Then you could use a cache size of two.

I think the idea of "keeping one older version around" makes sense.
Updated the change to keep the previous one. Previous keyinfo can be accessed using the previous alias.
While the method chooseClientAlias will always return "current".

Also removed the warning part.
Will directly rely on the OpenSslCachingKeyMaterialProvider PR for the cache eviction support.

Copy link
Copy Markdown
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only minor comments.

Comment thread util/src/main/java/io/grpc/util/AdvancedTlsX509KeyManager.java
Comment thread util/src/main/java/io/grpc/util/AdvancedTlsX509KeyManager.java Outdated
@ejona86 ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 27, 2026
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 27, 2026
@ejona86 ejona86 changed the title update AdvancedTlsX509KeyManager to support key alias for reloaded cert util: update AdvancedTlsX509KeyManager to support key alias for reloaded cert Mar 27, 2026
@ejona86 ejona86 merged commit b16da37 into grpc:master Mar 27, 2026
15 of 17 checks passed
@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Mar 27, 2026

Thank you!

kannanjgithub pushed a commit to kannanjgithub/grpc-java that referenced this pull request May 4, 2026
…ded cert (grpc#12686)

## Overview

Make the alias in `AdvancedTlsX509KeyManager` dynamic so it can be used
with Netty's `OpenSslCachingX509KeyManagerFactory` to update key
material after reload.

Fixes grpc#12670
Fixes grpc#12485

## Problem

When using `SslProvider.OPENSSL`, each TLS handshake must encode Java
key material into a native buffer consumed by OpenSSL, which can account
for ~8% of server CPU. Netty's `OpenSslCachingX509KeyManagerFactory`
avoids this by caching the encoded buffer keyed by alias — but the
previous implementation always returned `"default"`, so the factory
could never detect credential rotations and create a new cache entry on
cert reload.

## Details

- The alias is now set to `key-<N>` (e.g. `key-1`, `key-2`, ...) and
  incremented on every `updateIdentityCredentials` call, ensuring the
  same alias always maps to the same key material.
- One prior key value is kept to allow consistent handshaking during key
  changes.
chrisvest added a commit to netty/netty that referenced this pull request May 12, 2026
… cert rotation (#16523)

### Motivation: 

The current `OpenSslCachingKeyMaterialProvider` does not evict stale
entries after a cert rotation.
This is related to a performance concern when using grpc-java
(grpc/grpc-java#12670)

###  Modification:
Added `evictStaleEntries()`, which removes cached entries whose alias is
no longer recognized by the `X509KeyManager`. It is called on a cache
miss when new material is successfully loaded, so stale entries from
rotated credentials are pruned before inserting the new one.

 ### Result:
Better support for cert rotation. 
  Related discussion:
  grpc/grpc-java#12686
  grpc/grpc-java#12670

---------

Co-authored-by: Chris Vest <[email protected]>
chrisvest added a commit to netty/netty that referenced this pull request May 12, 2026
…e entries after cert rotation (#16802)

Auto-port of #16523 to 5.0
Cherry-picked commit: 40b824b

---
### Motivation: 

The current `OpenSslCachingKeyMaterialProvider` does not evict stale
entries after a cert rotation.
This is related to a performance concern when using grpc-java
(grpc/grpc-java#12670)

###  Modification:
Added `evictStaleEntries()`, which removes cached entries whose alias is
no longer recognized by the `X509KeyManager`. It is called on a cache
miss when new material is successfully loaded, so stale entries from
rotated credentials are pruned before inserting the new one.

 ### Result:
Better support for cert rotation. 
  Related discussion:
  grpc/grpc-java#12686
  grpc/grpc-java#12670

Co-authored-by: wzhang <[email protected]>
Co-authored-by: Chris Vest <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance overhead on handshake when using AdvancedTlsX509KeyManager with OPENSSL provider

3 participants