VCF9 - Extend Managment Domain Cluster


Introduction

Even though it’s a financially unwise decision right now, I decided to buy a third Minisforum MS-A2 because of the upcoming VCF 9.1 release and because I wanted to test other services like SSP in my lab. Luckily, I still had some RAM in my drawer from the good old days, when you didn’t have to sacrifice your firstborn or donate a kidney to get 128 GB of RAM. Yes, friends of sophisticated over-engineering, I am a blessed man.

But joking aside, this blog post is actually supposed to be about how I can expand my existing management domain, since I don’t want to reinstall everything—and to be honest, I’m not sure yet if it will all work, because my management domain currently consists of two clusters. The first cluster runs on two AMD MS-A2 servers with memory tiering and hosts my fleet, and the second cluster is a nested cluster on MS-01 servers (yes, the ones with the Intel CPUs). On top of that, I’m also running a nested instance of VCF9 that’s onboarded into my fleet. Of course, not everything can be running at the same time, since I don’t have enough RAM and processing power. So it’ll be interesting to see if the whole thing will work out somehow.

So if you’re able to read this blog, it must have worked out somehow; if not, you’ll never know.

Let’s Get Ready to Rumble

It all starts off quite innocently: first, unpack the new server, install the NVMEs—you know the drill. Then I install a fresh ESX 9.0.2 image and configure the network, DNS, NTP, memory tiering, and the Ryzen workarounds. If you want to read exactly what needs to be done, you can check out this article. I’ve written everything down in detail there.

JetKVM

ESX installer - JetKVM (click to enlarge)

The goal should be to have a basic ESX9 host that can resolve DNS, has SSH enabled, has NTP working, has over 377 GB of RAM (memory tiering), and has only one network adapter connected. In other words, it should be exactly the same as if I were installing VCF9 from scratch.

Now that these preparations have been made, I can get started on the actual cluster expansion. To do this, I first need to check my network pool settings and, if necessary, adjust the IP ranges for the NFS and vMotion networks.

Network Pools

The network pools are a bit hidden. In VCF 5.x, this was something you configured in the SDDC Manager. In VCF 9, this has now moved to vCenter under Global Inventory -> Hosts -> Network Pools.

Network Pools

Network Pools (click to enlarge)

Fortunately, I was a bit more generous here and still have some IP addresses available.

The TEP IP pool is managed in NSX and, of course, must have two free IP addresses per host. In my case, it just barely works. The pool can be adjusted here as well.

NSX TEP Pool

NSX TEP Pool (click to enlarge)

NSX is significantly more flexible than VCF in vCenter when it comes to the network pool. Once everything has been checked, I can now add the host.

Host onboarding

Onboarding is no longer done via the SDDC as it used to be, but must now also be done via the global inventory in vCenter. To do this, go to Global Inventory List -> Hosts -> Unassigned Hosts -> COMMISSION HOSTS

After that, you’re immediately greeted with a friendly checklist of everything that needs to be done:

- Host for vSAN/vSAN ESA/vSAN Storage workload domain should be vSAN/vSAN ESA/vSAN Storage compliant and certified per the VMware Hardware Compatibility Guide. BIOS, HBA, SSD, HDD, etc. must match the VMware Hardware Compatibility Guide.
- Host has the drivers and firmware versions specified in the VMware Compatibility Guide.
- Host has ESXi installed on it. The host must be preinstalled with supported versions (9.0.2.0.25148076)
- Host is configured with DNS server for forward and reverse lookup and FQDN.
- Hostname should be same as the FQDN.
- Management IP is configured to first NIC port.
- Ensure that the host has a standard switch and the default uplinks with 10Gb speed are configured starting with traditional numbering (e.g., vmnic0) and increasing sequentially.
- Host hardware health status is healthy without any errors.
- All disk partitions on HDD / SSD are deleted.
- Ensure required network pool is created and available before host commissioning.
- Ensure hosts to be used for VSAN workload domain are associated with VSAN enabled network pool.
- Ensure hosts to be used for NFS workload domain are associated with NFS enabled network pool.
- Ensure hosts to be used for VMFS on FC workload domain are associated with NFS or VMOTION only enabled network pool.
- Ensure hosts to be used for vVol FC workload domain are associated with NFS or VMOTION only enabled network pool.
- Ensure hosts to be used for vVol NFS workload domain are associated with NFS and VMOTION only enabled network pool.
- Ensure hosts to be used for vVol iSCSI workload domain are associated with iSCSI and VMOTION only enabled network pool.
- For hosts with a DPU device, enable SR-IOV in the BIOS and in the vSphere Client (if required by your DPU vendor).

Of course, I’ve carefully checked all of this and can confirm it.

After specifying the correct network pool and entering the correct storage type, fqdn, username, and password, the validation process failed with a certificate error—but why? After all, I created a new self-signed certificate during installation.

Cert Error

Cert Error (click to enlarge)

The problem is—and this is something the pre-check doesn’t tell you—that once you’ve rolled out your own certificates in your domain, the host certificate must be from the same CA. So I have to create a CA request via the ESX GUI, submit it to my Microsoft CA, and implement it on my host. I love certificates—not.

If you’ve never done this before, you can do it in the ESX GUI under Host -> Manage -> Security & Users -> Certificates. After the certificate exchange, the validation process completes successfully. The ESX server does not need to be restarted. Once the commissioning is complete, the host should now appear under “Unassigned Hosts.” That means half the work is already done.

Unassigned Host

Unassigned Hosts (click to enlarge)

Extend Cluster

Now comes the fun part: actually expanding the cluster. To do this, right-click on the existing cluster -> Add Host -> Add Unassigned Hosts. Here, too, the process differs from VCF 5.x, as this task is not performed in the SDDC.

The process itself is pretty straightforward, and the most important thing is that the uplink assignment on the distributed switch is correct. That’s actually the only potential source of error at this stage.

dVSwitch

dVSwitch (click to enlarge)

Here, too, there is a brief validation step after confirmation, and if no errors occur, the process should run fully automatically. The process can be monitored in both SDDC and vCenter. Personally, I find the view in the SDDC a bit clearer than the one in the vCenter Recent Tasks view. You can also check the progress of the host’s NSX configuration in NSX Manager. After about 10 minutes, the whole thing was done and dusted, and my management domain had been successfully expanded to three nodes.

Cluster

Finished cluster(click to enlarge)

To perform a manual validation, I booted up a test VM connected to an NSX network and ran a quick connectivity check to make sure my TEP network was working properly. Unfortunately, with VCF 9.0.2 and the MS-A2, I can no longer see the TEP tunnel status displayed correctly in NSX. The status simply remains gray—unknown. However, this is purely a visual issue, and other 9.0.2 users with the same hardware are experiencing the same problem.

Conclusion

What can I say? I expected the certificate error—I’ve fallen for that one before. But to make this article a bit more useful, I decided to highlight this error again. Generally speaking, though, I would have expected more problems, since one SDDC Manager (the one from my other nested domain) is unreachable, so I was pleasantly surprised. That’s because things like an inventory scan don’t run error-free unless all vCenters and SDDC Managers are accessible, and I had already envisioned having to get those two components online somehow to complete the expansion.

The most important thing when expanding is careful preparation and ensuring that all pre-checks are carried out. There’s a good reason why the checklist is included in the onboarding dialog. If you’ve followed all the steps, the expansion will indeed go smoothly.

End graphic