NSX Expiring Transport Node Certificates


Introduction

In NSX versions 4.1.x and 4.2.0, edge transport nodes and host transport nodes are instantiated using certificates with a validity period of only 825 days. This is obviously not desirable behavior and has been fixed in newer versions of NSX. Interestingly, I haven’t seen anything about this in the changelog. In NSX versions 3.X and 4.2.1 and higher, these certificates are valid for 10 years. But what exactly does that mean?

The Problem

First of all, don’t panic. The affected certificates are not visible in the GUI, but NSX Manager will issue a warning and display an error 30 days in advance. If you don’t see these messages in your NSX, you have more than 30 days to respond. Furthermore, it only affects transport nodes deployed with NSX 4.1.x - 4.2. If you have upgraded from NSX 3.X to one of the affected versions or have just deployed VCF version 5.2.x, you are safe.

Why It Matters

As the 825-day period approaches its end, you may encounter certificate expiration issues, potentially affecting NSX component communication and overall platform stability.

If you are affected, there are two scenarios.

  • Scenario A) The certificates will expire sometime in the near future, you may have received a warning in NSX Manager and therefore found my blog, or you are not yet aware of the problem because the certificates will expire sometime between 31 and 825 days from now.
  • Scenario B) The certificates have already expired and transport nodes are disconnected from NSX.

There is a solution for both scenarios. But let’s start with the simple scenario A first.

Find the problem

If you don’t know which NSX version you installed and want to check whether you are affected by this issue, Broadcom has a relatively simple solution. There is a script called CARR (Certificate Analyzer, Results and Recovery) that can be easily run on NSX Manager (NSX 4.1.x to 4.2) or on an external client (other NSX versions). Download Certificate Analyzer, Results and Recovery (CARR)from the official KB article.

In all cases, the script requires the following ports to be open between the client machine and the 3 NSX Managers

  • ssh port 22
  • https port 443
  • corfu port 9000

If running on the NSX Manager directly, ports 443 and 9000 will already be open between the 3 Managers.

If you want to run the script on the NSX manager, then the script must be executed directly from the /root directory. To do this, copy the script to your NSX Manager or Global NSX Manager using sftp and unzip it with tar and run the script.

tar -xvf carr-1.15.tar.gz
cd carr-1.15
./start.sh -d

Script options:

  • -o = this flag is used to force online mode

  • -t = specify lead time for expiring certificates, between 31 and 825 days.

  • -d = dryrun

I ran the script with the default settings (lead time for expiring 825 days) and the output was as follows:

CARR Script Validation Report

Certificate Checks Validation Results Probable Fix
API WARNING: Certificate is expiring in 680 daysSUCCESS: 10.28.0.3 Certificate is CA signed. Customer should import the new CA signed certificate.
VIP WARNING: Certificate is expiring in 680 daysSUCCESS: 10.28.0.3 Certificate is CA signed. Customer should import the new CA signed certificate.
STALE-CERTIFICATES SUCCESS: No stale certificates found.
APH_AR SUCCESS: 10.28.0.3
COMPUTE_MANAGER VC(CM): vcf02-vcsa.lab.home: SUCCESS: No issue with certificates found.
LOCAL-MANAGER-PI The NSX-manager is not federated. Skipping Local Manager cert validations
SITE-TO-SITE The NSX-manager is not federated. Skipping APH_AR cert validations
HOST No Host certificate is expiring or has expired
EDGE No EDGE node certificate is expiring or has expired
CCP SUCCESS: 10.28.0.3
APH_TN SUCCESS: 10.28.0.3
CBM_CLUSTER_MANAGER SUCCESS: 10.28.0.3
CBM_CORFU SUCCESS: 10.28.0.3

All validations done.

As you can see from the output for HOST and EDGE, my installation is not affected. However, it shows that my API and VIP certificate will expire in 680 days. CARR is also smart enough to recognize that it is a certificate signed by a CA and not a certificate issued by NSX itself.

To trigger TN cert replacement, environment details must be populated in a pre-existing file validation_config.yaml. This yaml file is located in the same folder as start.sh. My example yaml. I have disabled validation of all certificates except HOST and EDGE.

# user interface to provide the validation config
# user can specify if any certificate validation needs to be skipped.
# by default all certificate types will be validated.
# For Hosts , the vCenter cluster names for host must be specified. Script will validate hosts in those clusters only.
# For Edge node, Edge cluster name must be specified. Script will validate edge nodes in those clusters only.



HOST:
  validate: True
  clusters:
    - vcenter_name: vcf02-vcsa.lab.home
      vcenter_cluster_name: sfo-m01-cluster-001
EDGE:
  validate: True
  clusters: 
    - name: cl01
API:
  validate: False
VIP:
  validate: False
CBM-FILE-PERMISSIONS:
  validate: False
CBM_CORFU:
  validate: False
CBM_CLUSTER_MANAGER:
  validate: False
CORFU_SERVER:
  validate: False
CORFU_CLIENTS:
  validate: False
LOCAL-MANAGER-PI:
  validate: False
GLOBAL-MANAGER-PI:
  validate: False
STALE-CERTIFICATES:
  validate: False
APH_TN:
  validate: False
APH_AR:
  validate: False
CCP:
  validate: False
SITE-TO-SITE:
  validate: False
COMPUTE_MANAGER:
  validate: False

When I now run the script without dry run, I see in the output that all certificates except HOST and EDGE are skipped. This effectively prevents certificates that you do not want to exchange from being exchanged.

CARR Script Validation Report

Certificate Checks Validation Results Probable Fix
API Validation for the ‘API’ is disabled in the input config file
VIP Validation for the ‘VIP’ is disabled in the input config file
STALE-CERTIFICATES Validation for the ‘STALE-CERTIFICATES’ is disabled in the input config file
APH_AR Validation for the ‘APH_AR’ is disabled in the input config file
COMPUTE_MANAGER Validation for the ‘COMPUTE_MANAGER’ is disabled in the input config file
LOCAL-MANAGER-PI Validation for the ‘LOCAL-MANAGER-PI’ is disabled in the input config file
SITE-TO-SITE Validation for the ‘SITE-TO-SITE’ is disabled in the input config file
HOST No Host certificate is expiring or has expired
EDGE No EDGE node certificate is expiring or has expired
CCP Validation for the ‘CCP’ is disabled in the input config file
APH_TN Validation for the ‘APH_TN’ is disabled in the input config file
CBM_CLUSTER_MANAGER Validation for the ‘CBM_CLUSTER_MANAGER’ is disabled in the input config file
CBM_CORFU Validation for the ‘CBM_CORFU’ is disabled in the input config file

What to do if the transport nodes are already disconnected?

There is a solution for this as well. Unfortunately, CARR can no longer be used for separate transport nodes. The solution to this problem is to manually generate new certificates on the transport node and then push them from the transport node to the NSX Manager.

  • SSH to the Transport Node as root user

  • Empty Transport Node certificate and private key

cat /dev/null > /etc/vmware/nsx/host-cert.pem
cat /dev/null > /etc/vmware/nsx/host-privkey.pem
  • Generate a new self-signed TN certificate and key.

For NSX 4.1.x versions prior to 4.1.2.5:

  • Create a temporary openssl config file from the existing openssl config
cat /etc/vmware/nsx/openssl-proxy.cnf > /tmp/tmp-openssl-proxy.cnf
  • UUID is extracted and added to the temporary openssl config
echo "UID = $(grep -o '<uuid>[^<]*' /etc/vmware/nsx/host-cfg.xml | sed 's/<uuid>//')" >> /tmp/tmp-openssl-proxy.cnf
  • Add extension in the temporary openssl config
echo -e "[ req_ext ]\nbasicConstraints     = CA:FALSE\nextendedKeyUsage     = clientAuth\nsubjectKeyIdentifier = hash\nauthorityKeyIdentifier = keyid,issuer" >> /tmp/tmp-openssl-proxy.cnf
  • Replace the certificate, where below -days parameter specifies 3650 days (10 years) validity period
openssl req -new -newkey rsa:2048 -days 3650 -nodes -x509 -keyout /etc/vmware/nsx/host-privkey.pem -out /etc/vmware/nsx/host-cert.pem -config /tmp/tmp-openssl-proxy.cnf -extensions req_ext

For NSX 4.1.2.5 and higher

  • restarting nsx-proxy restart creates the new cert-key pair:
/etc/init.d/nsx-proxy restart
  • Identify NSX Manager thumbprint, ssh as admin user to NSX Manager
get certificate api thumbprint

To push the new cert-key pair to the Manager, from root user on the Host or Edge run (Any NSX Manager name or IP can be used)

Edge

su admin -c push host-certificate <Manager hostname-or-IP> username admin thumbprint <thumbprint from step 4>
Password for API user: <enter admin password>

Host

nsxcli -c push host-certificate <Manager hostname-or-IP> username admin thumbprint <thumbprint from step 4>
Password for API user: <enter admin password>

The official Broadcom KB article can be found here: Broadcom: Alarm For Transport Node Certificate is About to Expire.

Conclusion

At first glance, this problem sounds quite drastic, but if you have monitoring in place and regularly check the status of your NSX installation, the outage can be avoided relatively easily. The most important thing to know, however, is that upgrading your NSX version does not extend existing certificate lifetimes!