Imagine. Your important website/customer portal/service tools/VPN/… goes down unexpectedly. Everyone is running around stressed to the max. After trying out several things, someone discovers that a certificate has expired.
Sounds familiar? You’re not alone. Effective Certificate Lifecycle Management is a process many organizations struggle with. According to the 2023 State of Machine Identity Management Report, 77% of the responding organizations said they had at least two significant outages caused by expired certificates in the past 24 months. 62% Say they don’t know precisely how many certificates they have. Even when they do, renewing certificates is often overlooked, or people who know how to do it have left the organization or moved on to other assignments, so that certificates tend to expire with outages as a result. Such important contingency plans like revoking and replacing certificates, or even having to replace them en masse, are often overlooked. In addition, responsibilities are often unclear or misassigned to teams that do not have real means to exercise the responsibilities given, leading to an exercise in futility and frustration in addition to the outages.
To reduce the problems, we propose a maturity model that allows incremental steps where each step builds on the previous one, first reducing outages and then helping to make the process more efficient and convenient as well as introducing effective ability to quickly replace certificates en masse if need be. The maturity model consists of five incremental steps:
Step 1: Policy
Having the right policy is the foundation of everything. It provides the management backing, discipline, and theoretical basis.
One of the most important factors in the policy is to correctly assign responsibilities. The worst possible form is to assign the responsibility for a certificate to a team that does not have any way to either take action themselves or enforce it on the teams that can. This takes away the incentive of the teams that can take action, because when things hit the fan, another team takes the blame anyway. On that team, the term “scapegoat” will quickly become commonplace, leading to all kinds of HR issues.
There are two ways to assign responsibility effectively:
- Assign the responsibility to the owner of the certificate, in other words the owner of the asset(s) that actually consumes the certificate. This owner is also responsible for the correct working of the asset, and thus has a vested interest. She generally has access or can control the teams that do. Given an SLA, she can also draw on expertise from experts in the organization or outside to give advice.
- Assign responsibility to a central team and equip them with sufficient authority and management backing to be able to enforce action. This team can then co-ordinate and manage the teams that own the certificates.
Other elements to include in a policy beside responsibility could prescribe the principles how to select CAs, deal with new certificate types and templates, revocation, and validation, deploying trust certificates, major and security incidents, noncompliance, and changes in policy.
Step 2: Establish process on known certificates
Once the policy is in place, the next step is to establish a process to deal with the certificates you already know about. While admittedly this doesn’t do anything for certificates you don’t know about, it does give a partial reduction in outage and so reduces the time to value compared to starting this only when all certificates have been discovered. It also lays more groundwork for the next steps.
A good way to do this is to add an agenda point to an already existing meeting where all the asset owners are present, for example a Service Delivery Meeting. Or if such a meeting does not exist, create a new one. During this meeting, the certificates that require action are discussed and the owners can come forward and take action on it. If they do not, or the owner is unknown, depending on the choice of policy either action can be enforced, or the owner can be held responsible after an incident flushes him out. Either way, after a few of these, responsiveness increases dramatically, and with it the number of incidents starts to decrease.
Step 3: Discovery of unknown certificates
Once you’ve got the process underway for the known certificates, you want to know as much as possible about the ones that are unknown to you. This is the first step where powerful CLM tooling comes into play. Given access and sometimes installing an agent it can scan the various assets in your landscape, discover the certificates, and show all of them on a single pane of glass in near-real time. It also adds metadata, like the machine(s) and certificate store(s) where the certificate was discovered. Having all this information you can feed it into your already established process and manage the newly discovered certificates according to your already defined policy.
An important factor in choosing your tooling is the scope of assets it can cover for discovery. Not all tooling is created equal, some have a far bigger set of assets it can perform discovery on than others. Careful research which ones cover your landscape is necessary.
Step 4: Introduce self-service
Nobody likes having to go through formalities to get what they need. After you set up your new CLM tooling, it invariably has options to enroll for certificates from your selected PKIs in self-service. In most cases, these workflows can also leverage the metadata you already have, or your consumers introduce. So, by now this becomes an easier and more popular option. Self-service has the benefits of being faster and freeing up the expert teams for offering consultancy where their knowledge really counts.
Step 5: Automate everything you can
Once you got this far into the process, repetitive tasks will become obvious. These are prime candidates for automation. Most CLM workflows allow for automation of the certificate processes at the touch of a button (if that is even needed). Automation also becomes increasingly important as certificate lifetimes, either by compliance requirements or by choice, become smaller.
Automation also prepares you for the future. Certificate lifetimes have been steadily decreasing over the past few years, and are expected to be reduced still further. Current public TLS certificate lifetimes are 1 year, and Google announced their intention to further reduce this to 90 days. This means the lifetimes are approaching the threshold where manual enrollment will no longer be a viable option. While this one only applies to public certificates, it points to a best practice, and in the future things may be different, enforced either by software or by auditors. Moreover, in the past there have already been incidents where mass replacement of certificates was needed without prior warning. Only an automated, powerful CLM sitting in the landscape has any hope of successfully dealing with the problem of replacing large portions of, or all, the certificates in the landscape in a matter of days or even hours.
In practice, you may find scenarios where tooling brings their own automation, or scenarios where automation is not a feasible option. This is OK. Bringing your own automation still works as long as the certificates and their metadata are captured on your single pane of glass. If for technical or commercial reasons (like business partners) automation is not possible, other methods would have to be used. You can still automate everything you can, which brings down the costs and risks of error.
Using this maturity model, you can reduce your problems with certificate lifecycle management step by step. The first two steps do not require software or license costs and make a quick time to partial value. The next steps build on that foundation, discovering the unknown certificates, then refining the process with self-service and automation. With all five steps complete, you’ll have a solid Certificate Lifecycle Management ready for current and future challenges.