Trust and Reliability in Software Delivery
There are many facets to a service’s reliability; most of them are measured in minutes and milliseconds. These indicators are closely tied to a user’s experience and are what we build our Service Level Objectives around. But - How reliable is a service if it’s exposing your business to a high level of risk? How do you trust that what you’re releasing into production isn’t vulnerable or exploitable? As SRE at iHeartMedia we are responsible for the reliability of the platforms that support our production services. In order to consider our services completely reliable, we’ve started adding metrics around “trust” in our software delivery pipelines. We want to ensure that when we say a service is reliable, we really mean it.
Building Trust: Shifting Security Left
In partnership with SRE, Infosec started a program called “Trusted Software Delivery” (TSD). First objective of TSD is to test all artifacts that could make it to production. We test these artifacts on the first commit of code, before releasing to any stage in the environment. This provides a small feedback loop and allows developers to solve vulnerabilities immediately.
Some other artifacts we scan before releasing into our environments include:
Code: When ensuring code that will go into production is secure it must go through security tests as part of the delivery pipeline. All code goes through static analysis (SAST) and interactive (IAST) and are gated if a vulnerability threshold is overrun. This ensures we are not releasing known vulnerabilities into our production registry.
Container Registry: Images that make it into our container registries are scanned for vulnerabilities prior to being allowed to be uploaded and made available to our clusters.
Images: We baseline our images against CIS Level 1 standards and do not allow non-compliant images into the registry. Our clusters are only able to pull images that are in our trusted container registry.
Hosts: As with all other images, our cluster hosts are built to meet CIS baselines and are scanned for vulnerabilities regularly.
Preventing vulnerable artifacts from making it to production is the first of many steps in the journey to building trust in the safety of your production services. It’s also important that you build a platform that can protect production.
Reliability and Putting it all together
In order to ensure reliability of iHeart Media services, we have built a platform and a golden path for delivery teams to ship reliable code, on a resilient infrastructure.
Our aim is to provide a seamless onboarding process for our product teams and give them turn-key functionality that enables them to write secure code and deliver it quickly. One principal we focus on to encourage our shift left, is to meet product teams where they do their work. Using gitops for our CD process enables us to move closer to development while enabling our delivery teams to do what they do best – deliver features.
Once in production, we proactively protect our assets. In order to achieve this, we standardize on some core platform services and spend our time working on gluing these services together into a cohesive experience for the delivery teams. Integrated into these core services are security and delivery tools that work together to continuously inspect our running applications and core platform services for threats at each point of the software delivery lifecycle. As with any availability threshold, our observability services alert on any risk that is introduced into the platform and immediately act on the alert based on severity (preferably through automation, but sometimes manually).
Building a platform is time consuming but valuable work. Communicating your team’s objectives and tracking Key Performance Indicators is the best way to translate your efforts into business value. Here are some example Key Performance Indicators you can use to measure your platform’s performance and show the value your team is delivering.
|Objective||Build a Reliable Platform|
|KPI||99.9% Platform availability|
|KPI||Change failure rate <15%|
|KPI||Time to Restore Service < 1 Hour|
|Objective||Repeatable, Predictable, Scalable Delivery Pipeline|
|KPI||Full pipeline runs (including all tests) take less than 30 minutes|
|KPI||New team onboards to platform in less than 1 day|
|Objective||Trust in Software delivery|
|KPI||0 Vulnerabilities released into production via pipeline|
|KPI||100% CIS Level 1 Compliance for Platform|
|KPI||Critical vulnerabilities resolved in <24 Hours|
If building trust and reliability in software delivery is something that you are passionate about, iHeartMedia is looking for you!