VCF9 BUG: After deleting a VCF instance on VCF Operations, no VCF instance can be imported anymore


Introduction

While preparing for an article that has not yet been published, I ran into an interesting VCF bug. For the article, I had to delete a VCF instance from my operations. For background information, I currently have three VCF instances onboarded in my fleet. Since I had to reinstall one of them, I removed it according to the official instructions, and that’s when the problems started.

Disclaimer

Error pattern and impact

At first glance, the VCF instance can be deleted normally and everything appears to be functioning correctly. The error only becomes apparent when the vCLM (aka Fleet Manager) is required and the old VCF instance has been deleted. Typical errors:

  • Inventory Sync fails with the message that SDDC from instance X is not reachable
  • Deployment of a new VCF instance to the existing instance of VCF Operations fails
Error VCF deployment

Error VCF Installer (click to enlarge)

The error shown in the screenshot occurs during the “Join the existing fleet management appliance” step in the VCF deployment. When I take a closer look at the error message, I notice that an attempt is being made to log in to an SDDC Manager:

"exceptionMessage":"Could not get APl token from SDDC Manager vcf09-e02-sddc.lab.vcf.

Coincidentally, the SDDC Manager belongs to the very VCF instance that was previously removed. A search in Operations revealed that there were still configuration remnants in Operations. Even after deleting these remnants, the error persisted—you don’t want to know how many times I deployed and deleted a VCF instance in the last two days.

Troubleshooting

After searching forever, I found an entry in the Fleet Manager database that explains the behavior, and that’s also the point where we end up in the danger zone. So buckle up, take a snapshot of the Fleet Manager, and jump into the SSH session.

vLCM DB

SDDC in the vCLM DB (click to enlarge)

I must apologize for the screenshots, but I’m glad I took them at all. What you can see in the screenshot is that the old SDDC Manager is listed in the vm_lcops_sddc_manager table. In total, there were four SDDC Managers listed here—including the certificate in the database. Together with the other issues I found in Operations, this was the reason why the VCF installer was blocked and my inventory sync wasn’t working. When deleting a VCF instance, the SDDC does not appear to be deleted from the fleet database. I had exactly the same problem when switching from the public beta 9 to the final version of VCF 9. However, I attributed this to the beta installation and therefore did not spend much time troubleshooting.

Problem solving

Since there is no official solution yet, here is my unofficial solution to the problem. I have tried this several times and was able to verify that it works. Would I do this in a production environment? Definitely not. You have been warned.

First, the integration must be deleted, just as described in the official guide.

  • Administration-> Integrations - Delete Accounts

After that, the cloud proxy must be deleted, as this was overlooked.

  • Administration->Cloud Proxies - Delete Cloud Proxy

The next step is to delete the deployment target.

  • Fleet Management -> Lifecycle -> Settings -> Deployment Target - Delete vCenter
  • If vCenter is not deletable trigger inventory sync (doublecheck if the Integration is still deletet).

At that point, I wiped my removed VCF instance.

To delete the SDDC entry in the Fleet Manager DB, you must connect to the appliance via SSH and the ROOT user.

Connect to the DB:

su - postgres
/opt/vmware/vpostgres/current/bin/psql -U postgres vrlcm

Find your SDDC Manager Entrys:

SELECT sddcmanagerhostname,vmid FROM vm_lcops_sddc_manager;

You will receive a list with the host names and the VMID.

vrlcm=# SELECT sddcmanagerhostname,vmid FROM vm_lcops_sddc_manager;

  sddcmanagerhostname   |                 vmid                 
------------------------+--------------------------------------
 vcf09-e03-sddc.lab.vcf | 0838e38b-ba5c-40e3-b50e-2c86f0b2cf3c
 vcf09-e01-sddc.lab.vcf | 06c98f38-d855-4b4a-b62c-f81fae9158e0
 vcf09-sddc.lab.vcf     | 52b7a8cc-7d41-4c30-ad5c-960eb6db515d
(3 rows)

Deleting the SDDC entry:

DELETE FROM vm_lcops_sddc_manager WHERE vmid = 'xxxx';

Reboot the fleet manager.

Summary

Congratulations, we have hopefully survived the open-heart surgery. After that, I was able to simply press resume in the VCF installer and the VCF Instanc installation ran smoothly.

Installation finished

Installation complete (click to enlarge)

Of course, I can’t say for sure whether all hidden entries from the old SDDC have been removed from the environment. The only thing to do here is wait until the bug is officially fixed. As I already mentioned, I was able to reproduce the behavior several times. Now that this problem has been solved, I can get back to working on the article I had originally planned.

End graphic