March 2022 ~ Ahmed AbouZaid!

In Q1 2022, my friend Islam Wazery and I were working on an interesting enhancement for the open-source Vault Kubernetes Operator, Bank-Vaults.

It's one of my biggest open-source contributions recently. In this meta post, I like to share some details about the problem we were trying to solve, goal, available solutions, implementation details, and challenges during working on the new feature.

1. Intro

Anyone who used Kubernetes knows that the Secret resources are encoded, not encrypted, so you probably need another solution to manage your secrets and sensitive data. HashiCorp Vault is one of the best tools for that purpose.

In case you didn't use Vault before, here is a short intro from its docs:

Vault is an identity-based secrets and encryption management system. A secret is anything that you want to tightly control access to, such as API encryption keys, passwords, or certificates. Vault provides encryption services that are gated by authentication and authorization methods. Using Vault's UI, CLI, or HTTP API, access to secrets and other sensitive data can be securely stored and managed, tightly controlled (restricted), and auditable.

HashiCorp already provides resources to install Vault on Kubernetes as well as a Helm chart for Vault. However, there is no official solution from HashiCorp to manage Vault itself on Kubernetes. And here comes Bank-Vaults, the Vault Swiss Army Knife!

Bank-Vaults by Banzai Cloud is an open-source umbrella project which provides various tools (Operator, Configurer, Vault Env injector, and more) for ease of use and operation of Hashicorp Vault.

The most exciting part here is the Vault Operator by Bank-Vaults. Which allows you to manage Vault on Kubernetes. And based on my research, it's the only operator in the market for Vault, so I started a PoC to use it in production.

2. Problem

After the initial setup, It seemed that the Bank-Vaults operator was mature and production-ready. It had many features like bootstrap, sealing, unsealing, cloud backend, and all the features we need, but it's missing an important feature, it didn't support full Vault management! It handled the creation of Vault's config (like policies, secrets engines, auth methods, etc.), but it didn't handle the removal of the config! And that was confirmed by the issue no. #605 which had been unresolved for more than 2 years! (Aug 2019)

Next, I've checked the operator code, and it turned out that the operator works only with create/update, but it doesn't have any mechanism to work the config removal. No one fixed that because it's a full feature that needs much work (well, that's why it needed more than 2 years to fix).

That means the operator doesn't fully manage Vault! Unfortunately, this is a deal-breaker to use the operator in production. And to fix that, there are several ways to remove the unmanaged config. In the following sections, I dive more into the available mechanisms to handle the config removal, but first, let's set the goal.

3. Goal

I already experienced managing Vault using Terraform but doing that on Kubernetes would be a snowflake where it needs an extra stateful tool! I like to use Terraform for the infrastructure like Kubernetes clusters but not for the apps. Hence, I don't want to go that way.

So the ultimate goal is that the operator should be able to manage Vault completely. It should add and remove Vault config like policies, secrets engines, auth methods, etc. And that should be done using the Kubernetes ecosystem in a cloud-native approach.

4. Available solutions

Based on my previous experience with code, infrastructure as code, and configuration management tools, there are several ways to achieve config removal, and each one of them has pros and cons.

4.1 Purge anything not in the config

The first mechanism is simply the purge approach, where the operator removes anything not in the configuration. This mechanism compares Bank-Vaults config and Vault config and removes the differences.

So this approach is somewhat radical. It doesn't allow any manual changes, and any change outside the configuration will be removed. But the good side is that, well, it doesn't allow any manual changes! So the configuration is the source of truth. However, there is mitigation to allow some manual changes by excluding some of the configs. I will discuss it in the implementation section.

4.2 Compare differences between the old and new config

The second mechanism is the last-diff approach, where the operator compares the old and the new config and removes anything not in the new config. This way is considered "semi-stateful" where you need to have the old config and the new config to compare them. This approach allows manual changes outside the operator, but the operator is only aware of the last changes.

4.3 Manage changes statefully

The third mechanism is the diff approach, where the operator maintains a state of all its operations, and with any new change, it compares the changes with the state (this is the Terraform way). This way is fully-stateful, which allows for tracking the changes done by the operator and allows manual changes outside the operator.

4.4 Handel config individually

Finally, the fourth mechanism is the flag approach, where the operator manages the config according to a config flag. For example, each policy in the operator config could have a field called "state", and its value could be "present" or "absent" (this is the style of config management tools like Ansible). In this solution, it's possible to have managed, and unmanaged config but the biggest downside is that you need to deal with the config on the individual level.

5. Implementation

In the cloud-native era, the first style looks the most suitable approach where full management is assumed. So anything that is not in the config would be removed. And to mitigate that behavior, it's 'possible to exclude some sections like policies, auth methods, etc., so they could have manual changes if needed.

Vault has main 7 configuration sections:

Audit
Auth
Groups
GroupAliases
Plugins
Policies
Secrets

Each section already has the "add" mechanism, and it's able to create the config in Vault, and the goal is to add the "remove" mechanism to have full CRUD (Create, read, update and delete). However, the "adding" code wasn't follow Golang style and it needed to be refactored. So for each section, the code is refactored first, then the "removing" code is added.

Let's take policies as an example (which will be the same way for all 7 configs mentioned above); the "removing" part works as the following:

Bank-Vaults operator reads its config file with managed policies.
Then, it calls Vault to get all already configured policies.
Then, it compares what's in the config (the desired state) with Vault (the actual state).
Finally, if there are differences, then the Bank-Vaults operator calls Vault to delete the unmanaged policies.

The final step is creating E2E tests to run in the CI (Github Actions). The tests simply check different cases like removing a config while the purge option is disabled/enabled fully/partially. Now let's take a look at the challenges I had while working on this feature.

6. Challenges

In the following sections, I'd like to share the top challenges while introducing full Vault management in the Bank-Vaults operator.

6.1 Project complexity

Bank-Vaults is not just the operator; it's an umbrella project to work with Vault. It's a mono repo with many shared parts. For example, the operator relies on a CLI tool with the same name.

Hence, the first challenge was to understand the project structure and where exactly to change, and how the changes could affect the rest of the project.

6.2 Refactoring the write path

After a thoughtful dive into the project, it was clear what and where I should change to fully manage Vault by Bank-Vaults (so it can add and remove config in Vault). However, the write path code in the operator (that's responsible for creating and updating managed config) doesn't follow the Golang style. It was more like Python written in Golang. It reminded me of when I wrote Golang for the first time, coming from a Python background.

Leaving the write path code as it is would make the code oddly bad and redundant. So the first step was refactoring the "write path" code, then adding the "remove path" code (which is responsible for removing any unmanaged config). And this was the second challenge to solve before the actual implementation.

6.3 Only generic acceptance tests

Another challenge was that the part I wanted to change didn't have any unit tests, but only generic acceptance tests were available. Which makes things harder to change. I needed to pay extra attention to ensure I didn't break anything while refactoring the existing code and introducing the new feature. That also means I should write some E2E tests to avoid this situation in the future.

6.4 Coordination

As I mentioned before, this feature is a bit big, and it would be implemented by 2 people (my friend Wazery and me). At the same time, it's a new project we didn't work on before, and we didn't work together before. So we needed to make sure that everything was clear and both of us were aligned to deliver this feature in high quality.

7. Result

With the PR no. #1538, and Bank-Vaults v1.15.1 was able to fully or partially purging unmanaged configuration in Vault.

The user has the option to fully or partially purge unmanaged config as shown here:

purgeUnmanagedConfig:

  # This will purge any unmanaged config in Vault.
  enabled: true

  # This will prevent purging unmanaged config for secret engines in Vault.
  exclude:
    secrets: true

To avoid behavior change, and since this feature is destructive, it was safe to make it disabled by default. The user needs to enable it explicitly in Bank-Vaults config. And as usual, it's recommended to test it in a non-production environment first.