Site Reliability Engineering (SRE) Foundation v1.2 Questions and Answers
A bank has been using traditional monitoring tools for ensuring that their systems are available and operating as planned. Their strategic initiatives now include a renewed focus on customer experience as well as identifying ways to scale service.
Why would migrating to an observability approach be important now?
Options:
It’s better for managing container workloads and dynamic architectures
Monitoring at the component level may no longer provide the right data
It is impossible to anticipate all potential problems
All of the above
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
All the listed reasons correctly describe why observability becomes essential in modern, user-focused, dynamically scaling architectures.
The SRE Workbook and Google Observability guidance both emphasize that traditional monitoring is insufficient in environments where:
Services are distributed
Traffic is unpredictable
Customer experience is a priority
Cloud-native, containerized, or microservice architectures are used
Key excerpts:
From Google’s Observability guidance:
“Monitoring relies on known failure modes; observability enables teams to explore unknown-unknowns and understand complex, dynamic systems.”
From the SRE Workbook:
“As systems scale and architectures shift toward microservices or containers, component-level monitoring provides an incomplete picture. Observability enables teams to understand user impact and system behavior holistically.”
Thus:
A Observability is critical for containerized and dynamic environments.
B Component monitoring alone cannot show customer experience or end-to-end reliability.
C Observability helps teams diagnose issues that could not be predicted in advance ("unknown unknowns").
All statements are correct, making D the correct answer.
Which of the following BEST describes observability?
Options:
Monitoring applications to detect problems and anomalies
Performing fitness tests and health checks
A measure of how well internal states of a system can be inferred from knowledge of its external outputs
Collecting data from multiple endpoints to aggregate and observe application performance
Answer:
CExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
The term observability comes directly from control theory and refers to the ability to infer the internal state of a system from its external outputs. Modern SRE and observability practices adopt this definition.
Google’s Site Reliability Engineering guidance (SRE Book Addendum on Observability) states:
“Observability is a property of a system that allows operators to understand its internal state by examining its outputs such as logs, metrics, and traces.”
This aligns exactly with Option C, the formal definition.
Why the other options are incorrect:
A Monitoring is part of observability, but observability is much broader.
B Health checks are simply one signal; they do not represent observability.
D Data collection is a mechanism, not the definition of observability itself.
Thus, C is the correct and academically accurate definition.
Which of the following communication and collaboration practices BEST contribute to the effectiveness of the SRE team?
Options:
Project managers share limited data only upon request.
Data is flowing freely within and around the SRE team.
Data in SRE should be managed separately from others.
Team members should manage their own data discretely.
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE is built on transparency and broad information sharing. The SRE Book states: “High-quality operations require that information flows freely between product development, SRE, and associated teams.” (SRE Book – Chapter: Communication and Collaboration). Effective incident management also depends on complete data availability: “Centralized, shared information reduces cognitive load and improves incident resolution.” (SRE Workbook – Incident Management).
Option B aligns perfectly with SRE principles: data must flow freely, ensuring everyone has access to metrics, logs, architecture details, incident context, and SLOs.
Options A, C, and D promote restricted or fragmented data practices, which are directly contrary to SRE design. SRE teams avoid information silos.
Thus, B is correct.
Which of these approaches can alleviate linear scaling toil?
Options:
Manual scaling of services
Using auto-scaling capabilities
Outsourcing development
Switching cloud providers
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Linear-scaling toil refers to work whose effort increases proportionally to service growth, such as manually provisioning servers or handling capacity expansion. The Google SRE Book, Chapter “Eliminating Toil,” explains:
“Toil is work that scales linearly with the size of your service. A core strategy for reducing toil is to introduce automation that breaks the linear relationship.”
Auto-scaling capabilities directly address linear-scaling toil by automating resource allocation based on load or demand. This prevents engineers from repeatedly and manually adjusting infrastructure as usage grows.
The SRE Workbook also emphasizes:
“Infrastructure automation such as auto-scaling removes a major source of linear scaling toil by ensuring that capacity adjusts automatically as services grow.”
Why the other options are incorrect:
A Manual scaling is linear-scaling toil, not a solution.
C Outsourcing development does not reduce operational toil.
D Switching cloud providers alone does not solve toil unless automation is introduced.
Thus, B is the correct answer.
Microservices are independent services that are developed, deployed, and maintained separately.
Which of the following BEST justifies the use of this application architecture?
Options:
Modernizing and refactoring legacy applications
Modernizing the user interface of the core system
Creating a simple, lightweight business application
Building a basic product fast, as a proof of concept
Answer:
AExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE supports microservices architecture because it improves reliability by reducing blast radius, allowing independent deployments, and enabling scalable autonomous teams. The SRE Book notes: “Microservices enable teams to independently iterate and improve reliability without the constraints of large monolithic systems.” (SRE Book – Distributed Systems). One of the strongest reasons to adopt microservices is modernizing and refactoring large legacy monoliths, allowing them to be broken into independently deployable, maintainable components.
Option A is therefore the best justification.
Options B, C, and D may involve architectural choices, but they do not explain why microservices are the preferred architecture for reliability and scalability.
Thus, A is correct.
An error budget policy is BEST described as being designed to do which of the following?
Options:
Send alerts when error budget is at half
Shift the locus toward more innovation
Decide when and how to intervene
Prevent introduction of significant bugs
Answer:
CExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
The SRE Workbook describes an Error Budget Policy as a formal decision-making framework that defines what actions to take when a service consumes its error budget. Specifically, Google writes: “An error budget policy establishes when and how teams must intervene, whether to pause releases, prioritize reliability work, or adjust processes.” (SRE Workbook – Error Budget Policies). The purpose is to create predictable responses to reliability degradation—not simply alerting, innovation boosting, or bug prevention.
Option C best matches the definition: deciding when and how to intervene based on error budget burn.
Option A is only an alerting rule, not a policy.
Option B is an outcome of a healthy budget, not the policy’s purpose.
Option D is too narrow and is not how error budgets are framed.
Thus, C is correct.
When outages are repetitive and similar, they become a form of toil.
Which of the following describes the MOST compelling reason to adopt advanced technologies and artificial intelligence (AI)?
Options:
To increase reliability by reducing MTTR and MTRS
To increase the mean time to repair services (MTTR)
To increase the mean time to restore services (MTRS)
To increase reliability and achieve perfect MTRS
Answer:
AExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE defines toil as “manual, repetitive, automatable, tactical work tied to running a service” (SRE Book – Eliminating Toil). Repetitive outages are specifically noted as a form of operational toil. The SRE Book and SRE Workbook emphasize adopting automation, intelligent tooling, and machine-learning–assisted systems to reduce toil and decrease Mean Time to Repair (MTTR) and Mean Time to Restore Service (MTRS). The books state: “Reducing MTTR directly increases system reliability more effectively than attempting to eliminate all failures.” (SRE Book – Chapter: Managing Incidents).
AI and advanced automation help detect issues faster, classify patterns, trigger automated remediation, and reduce human intervention—delivering reliability gains through faster repair rather than perfect uptime.
Option A is the only option aligned with SRE’s reliability philosophy.
Options B and C incorrectly suggest increasing MTTR/MTRS.
Option D refers to “perfect MTRS,” which is impossible and contradicts SRE’s acceptance of failure.
Thus, A is correct.
Identify the missing word(s) in the following sentence:
Site reliability engineering is a _________ approach to IT operations.
Options:
structural engineering
security engineering
software engineering
simulation engineering
Answer:
CExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Google’s SRE definition is explicit: “Site Reliability Engineering is what happens when you ask a software engineer to design an operations team.” (SRE Book – Introduction). This clearly defines SRE as a software engineering approach applied to operational problems. The goal is to use software techniques—automation, coding, testing, version control, CI/CD, observability—to improve reliability and reduce toil. The book emphasizes: “SRE applies software engineering to operations work.” (SRE Book – What Is SRE?).
Option C is the only answer fully aligned with the official definition.
Options A, B, and D do not correspond to the SRE definition provided by Google.
Thus, the correct missing phrase is software engineering.
What metrics will embracing failure help to improve?
Options:
Mean time to detect and mean time between system incidents
Change lead time and change failure rate
Empirical test data and mean time to recover service
Mean time to detect and mean time to recover
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Embracing failure—through practices such as blameless postmortems, chaos engineering, and proactive detection—enables organizations to improve their incident response performance. This directly improves:
MTTD (Mean Time to Detect)
MTTR (Mean Time to Recover)
The Site Reliability Engineering Book, chapter “Postmortem Culture,” states:
“By examining failures without blame and learning from them, organizations improve their ability to detect issues faster and recover more quickly.”
Similarly, in the SRE Workbook, section on incident response:
“Learning from incidents is essential to reducing time to detection and time to mitigation.”
Why the other options are incorrect:
A MTBSI (Mean Time Between System Incidents) is influenced by architecture and testing, not directly by embracing failure.
B These are DORA metrics — important, but not primarily tied to failure-embracing practices.
C Too vague and not a standard SRE metric pair.
Thus, D is the correct answer.
Before getting into the technical details of a Service Level Objective, what should be done?
Options:
Identify which tasks should be categorized as toil
Evaluate automation capabilities
Start a conversation from the customer’s point of view
Assess what resources would be needed to meet the Service Level Objective
Answer:
CExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Before defining any technical details of an SLO, the SRE guidance is clear: the conversation must start from the customer’s point of view. SLOs exist to represent what reliability level users genuinely require—not internal assumptions or engineering preferences.
The SRE Workbook, Chapter “Implementing SLOs,” states:
“The process must begin by understanding what your users need from the service and what good performance actually means from the user’s perspective.”
Likewise, in the Site Reliability Engineering Book:
“SLOs capture the reliability target that makes sense for the users and the product, which is why defining them must begin with understanding the user experience.”
This means that SLO development begins with analyzing:
What users value
What reliability thresholds they notice
What failures matter to them most
Only after this understanding is established should teams discuss metrics, thresholds, SLIs, and error budgets.
Why the other options are incorrect:
A. Identify toil — Relevant to operations, not SLO creation.
B. Evaluate automation — Important for reducing toil, unrelated to initial SLO definition.
D. Assess resources — Planning happens after SLO definition, not before.
Thus, the correct answer is C.
Which of the following is a principle of SRE-Led Service Automation?
Options:
No automated tests in production
Environments provisioned using IaC
Using unsigned artifacts in production
Adding as much hardware as possible
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE-led service automation focuses on making environments reproducible, reliable, and consistent. One of the key principles aligned with Google SRE practices is the use of Infrastructure as Code (IaC), which allows environments to be provisioned automatically, consistently, and predictably.
The Site Reliability Engineering Book, in its discussions on automation, states:
“Automation implemented as code ensures that environments are consistent, repeatable, and less prone to human error.”
The SRE Workbook expands on this concept:
“Infrastructure as Code allows services to scale and evolve reliably by ensuring that configuration and infrastructure changes are automated and version-controlled.”
IaC is fundamental to:
Reducing toil
Increasing reliability
Enabling consistent automation across environments
Reducing configuration drift
Why the other options are incorrect:
A SRE supports testing in production; it does not ban automated tests.
C Using unsigned artifacts violates security and reliability best practices.
D Adding hardware is not an automation principle and contradicts efficiency goals.
Thus, the correct answer is B.
Which of the following is the MOST likely outcome when the workforce puts the “parts” before the “whole”?
Options:
Increased employee motivation and morale
Increased introversion and decreased efficiency
A voluntary sharing of resources and information
A focus on common interests and lesser conflicts
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE emphasizes organizational alignment and collaboration, warning against siloed thinking. The SRE Book highlights: “Local optimizations at the expense of the broader system lead to inefficiency, misalignment, and reduced reliability.” When individuals or teams focus only on their own “parts” instead of shared goals (“the whole”), it results in decreased cross-team communication, isolation, operational friction, and reduced efficiency.
Option B captures this SRE-documented outcome: increased introversion (siloing) and decreased efficiency.
Option A and D describe positive outcomes that contradict SRE principles of collaboration.
Option C implies healthy sharing, which does not result from silo-first behavior.
Thus, B is correct.
Which of the following is the BEST description of a Customer Reliability Engineer (CRE)?
Options:
They take a software engineering approach to redesign all cloud services
They use deep engineering expertise to improve the cloud provider’s services
They work with the cloud provider's SRE team to ship and build new features
They integrate with the customer’s operations team to share responsibilities
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Customer Reliability Engineering (CRE) is described in Google's SRE literature as an extension of SRE practices outward to customers who run workloads on cloud platforms. The SRE Book and the SRE Workbook state: “CRE is the practice of sharing SRE principles with customers, working closely with their operations teams, and establishing shared responsibility for reliability.” (SRE Workbook – Chapter: Customer Reliability Engineering). A CRE team collaborates directly with customer engineering and operations teams to identify reliability risks, review architectures, and co-manage SLOs, but does not redesign cloud services or build new features.
Option D matches the exact intention: CRE integrates with the customer’s operations team to share reliability responsibilities, applying SRE methods to customer systems and ensuring both customer and provider work jointly on reliability goals.
Option A is incorrect—CRE does not redesign cloud services.
Option B misinterprets CRE as improving the provider’s internal systems.
Option C incorrectly focuses on feature shipping; CRE is about reliability guidance, not feature delivery.
Thus, D is the correct and SRE-authentic answer.
What is the primary difference between SRE and DevOps?
Options:
SRE is an implementation of DevOps but focuses mostly on post-production responsibilities
DevOps is mostly for software engineers and SRE is mostly for infrastructure engineers
DevOps encourages closer collaboration between development and operations whereas SRE is about building a silo around production operations
DevOps and SRE are the same thing
Answer:
AExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
The primary difference between SRE and DevOps lies in their implementation focus and origins, though they share similar objectives. According to Google’s official SRE documentation:
“SRE can be seen as a specific implementation of DevOps with some idiosyncratic extensions.”
— Site Reliability Engineering Book, Chapter: What is Site Reliability Engineering?
While DevOps is a broad cultural and organizational philosophy aimed at closing the gap between development and operations through collaboration and automation, SRE provides a concrete, engineering-driven approach to achieving those goals — particularly through practices like error budgets, SLIs/SLOs, toil reduction, and incident response.
SRE focuses heavily on the post-production lifecycle — including reliability, monitoring, capacity planning, and incident response — whereas DevOps includes these concerns but emphasizes the entire software delivery lifecycle. Hence, Option A is the correct and most accurate answer.
Options B and C are incorrect:
B wrongly implies a division of roles (DevOps = developers, SRE = infrastructure), which is not how these frameworks operate.
C misrepresents SRE — it does not build silos but instead emphasizes shared responsibility and transparency in production systems.
D is incorrect because, while aligned, SRE and DevOps are not identical.
An organization is experiencing significant turnover of IT operational staff with most not staying more than one year. The HR Director and IT Director are trying to determine why they are having difficulty retaining IT operations professionals.
What could be one of the reasons?
Options:
Overload and disruptive work patterns
Lack of time for skills development
More time spent managing the backlog than fixing problems
All of the above
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
High turnover in IT operations roles is often driven by a combination of factors, not just one. The Google SRE Book, Chapter “Eliminating Toil,” outlines that excessive toil, unpredictable work, and overload contribute to burnout and churn:
“Excessive operational workload and interrupt-driven work lead to burnout and high attrition among engineering and operational staff.”
The SRE Workbook adds:
“Teams overwhelmed with toil struggle to innovate, automate, or develop new skills, creating frustration and increasing turnover.”
Each option listed represents a recognized driver of burnout in SRE and operations environments:
Overload and disruptive work patterns are known contributors to burnout.
Lack of time for skills development demotivates engineers and prevents career growth.
Backlog-driven cultures force teams into reactive rather than proactive work.
The combination of these factors matches common causes of attrition in operations teams. Therefore, all of the above is the correct answer.
Which of the following BEST describes the engineering side of SRE?
Options:
Applying network and infrastructure development best practices for stable operations and good reliability
Applying network design and deployment best practices to achieve operational performance targets
Applying infrastructure engineering principles to build and maintain the stable delivery of operational services
Applying software development best practices to solving operational problems and automating solutions
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
The foundational definition of SRE, as stated in Google’s SRE Book, is that SRE uses software engineering as its primary tool to solve operational problems: “SRE is fundamentally doing operations work using software engineering approaches.” (SRE Book – What Is SRE?). This includes building automation, writing tools, creating pipelines, and eliminating manual work. The “engineering side” focuses specifically on applying coding practices, testing, CI/CD, version control, and automation frameworks to operational domains such as deployment, monitoring, incident response, and capacity planning.
Option D captures this precisely: using software engineering best practices to solve operational issues and drive automation.
Options A, B, and C focus too narrowly on network or infrastructure engineering. While these can be components of SRE, they do not describe its engineering foundation as Google defines it.
Thus, D is the correct answer.
In a blameless post-mortem, those involved report
Options:
Without fear of retribution
Assumptions they had made
Both A and B
Using testing data
Answer:
CExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
A blameless post-mortem is a foundational SRE practice that encourages truthful, detailed reporting after an incident. The purpose is to learn, not punish. Google SRE emphasizes that engineers must feel psychologically safe to report what they did, what they assumed, and why they made those decisions.
From the Site Reliability Engineering Book, Chapter “Postmortem Culture”:
“Blameless postmortems encourage engineers to share the full details of their actions and assumptions without fear of punishment, enabling learning and preventing repeated failures.”
The book further states:
“Understanding the assumptions made during an incident is critical to uncovering systemic issues.”
Thus:
Engineers must report without fear of retribution
They must report assumptions and decisions made during the incident
Therefore, the correct answer is C. Both A and B.
Why the other options are insufficient:
A Only partially correct
B Only partially correct
D Testing data may be included, but it is not the defining feature of blameless postmortems
Which of the following is the definition for Application Performance Management (APM)?
Options:
The highly automated communications process by which measurements are made and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring
The monitoring and management of performance and availability of software applications
The use of a hardware or software component to monitor system resources and performance of a computer system
Ways for engineers to communicate quantitative data about systems
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Application Performance Management (APM) refers to a set of tools and practices used to monitor and manage the performance, behavior, and availability of software applications. Although APM is not defined exclusively in the Google SRE Book, it is described within the broader context of monitoring and observability.
In the SRE Workbook, under Monitoring:
“Application monitoring tools provide insights into the performance, latency, availability, and behavior of applications to help engineering teams maintain reliability.”
Industry-standard APM frameworks (including Google Cloud Operations Suite, formerly Stackdriver) define APM as:
“The monitoring and management of application performance and availability.”
Why the other options are incorrect:
A describes telemetry, not APM.
C describes system monitoring (infrastructure), not application performance monitoring.
D refers to communication of metrics, not the monitoring of application performance.
Therefore, B is the correct definition.
Which of the following BEST describes the two key elements that an error budget balances?
Options:
Risk and reward
Innovation and reliability
Features and benefits
Time and money
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Error budgets represent the allowable amount of unreliability in a system. Google defines the purpose of error budgets as: “balancing the pace of innovation with the need for reliability.” (SRE Book – Service Level Objectives). When the error budget is healthy, product teams can release features quickly; when it is exhausted, reliability work takes priority. This balance prevents over-investment in reliability and enables safe innovation.
Option B—innovation and reliability—is the exact phrasing used in Google’s SRE literature.
Options A, C, and D do not reflect the core purpose of error budgets.
Thus, B is the correct answer.
A team has exceeded their error budget by 10% in a particular month.
Give an example of what should happen next as a consequence.
Options:
Sprint planning may only pull post-mortem action items from the backlog
The Error Budget is reviewed to determine if it was realistic for the product or timeline
The Error Budget is extended for another month to determine if this breach was an anomaly
The error budget is ignored in subsequent months as it is creating the wrong kind of behavior
Answer:
AExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
When a team exceeds its error budget, SRE practice requires applying error budget policies that restrict feature releases and shift focus toward reliability improvement. The idea is to prevent further degradation of user experience and ensure the service meets the agreed reliability targets.
The Site Reliability Engineering Book, Chapter “Service Level Objectives,” states:
“If the service exceeds its error budget, all new feature launches or risky changes are halted until reliability returns to acceptable levels. Engineering work should be directed toward addressing the causes of the budget overrun.”
This aligns with option A, which describes a reliability-focused response during sprint planning. Limiting sprint planning to post-mortem action items and reliability improvements is a direct application of error budget policies.
Additional guidance from the SRE Workbook:
“Error budget burn should directly influence decision-making. When the budget is exhausted, the team must focus on remediation work rather than new features.”
Why the other options are incorrect:
B Reviewing the error budget’s realism can be done periodically, but it is not the immediate consequence of a breach.
C Extending the error budget invalidates its purpose and is discouraged.
D Ignoring the error budget contradicts the entire SRE model and Google’s official guidance.
Therefore, A is the only correct answer.
Which of the following BEST describes an advantage of a container-based structure?
Options:
The portability created by containers enables software to run independently of the host operating system
The lightweight nature of containers requires fewer developers to actually create the software code
Software runs much more efficiently in containers because of the ability to run on virtual machines
The security of applications in containers is simplified because they share the security of the host system
Answer:
AExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
Containers provide a major advantage that aligns with SRE: portability and environment consistency. The SRE Workbook describes containers as: “lightweight, portable units that encapsulate applications and dependencies, ensuring consistent behavior across environments.” This independence from the host OS environment enables predictable deployments and simplifies automation, scaling, and orchestration—especially when used with Kubernetes.
Option A captures this exact benefit: portability and independence from the host OS.
Option B is incorrect—containers do not reduce the number of developers required.
Option C incorrectly claims that efficiency comes from virtual machines; containers are typically more efficient because they avoid VM overhead, not leverage it.
Option D is incorrect—containers do not “inherit” security automatically; in fact, they require additional security controls.
Thus, A is the correct answer.
What is the goal of SRE?
Options:
To spend 50% of a SRE's time on operational tasks and 50% of the time on development tasks to reduce toil
To ensure that Service Level Objectives are consistently met through monitoring and observability
To create highly reliable post-deployment operational systems that align with DevOps and Agile
To create ultra-scalable and highly reliable distributed software systems
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
The goal of Site Reliability Engineering (SRE) is to create ultra-scalable and highly reliable distributed software systems. This principle is clearly articulated in the foundational text of SRE, the Google Site Reliability Engineering book.
From Chapter 1: Introduction of the Site Reliability Engineering book:
"SRE is what happens when you ask a software engineer to design an operations team. Our approach to service management is rooted in our belief that engineering work to create scalable and highly reliable systems is critical to the success of modern software."
— Site Reliability Engineering Book, Chapter 1
This statement establishes that building and maintaining scalable, reliable systems is the core mission of SRE. While concepts like reducing toil (option A), implementing SLOs (option B), and aligning with DevOps (option C) are vital components of the SRE practice, they support the overarching goal — which is option D.
Therefore, the correct answer is D: To create ultra-scalable and highly reliable distributed software systems.
Where should an organization store versioned and signed artifacts that are used to deploy system components?
Options:
In the Configuration Management System (CMS)
In a Subversion source code repository
In a Definitive Media Library (DML)
In a secure artifact repository
Answer:
DExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE and modern DevOps best practices require that build artifacts—such as binaries, container images, and deployment packages—be stored in a secure, versioned artifact repository. These repositories ensure integrity, traceability, immutability, and security of deployment packages.
While the SRE Book does not use the ITIL term DML, it emphasizes:
“All production binaries should be stored in a secure, versioned repository to ensure consistent, repeatable, and trustworthy deployments.”
— Site Reliability Engineering Book, section on Release Engineering
The SRE Workbook expands on this principle by emphasizing signed and verified artifacts:
“To ensure safe rollout, artifacts must be built once, stored securely, signed, versioned, and deployed from a controlled artifact repository.”
Why the other options are incorrect:
A A CMS manages configuration, not deployment artifacts.
B Subversion is a source code repository, not an artifact repository.
C A DML is an ITIL concept, but SRE practice does not rely on it; instead, SRE uses modern artifact repositories (e.g., GCR, ACR, Artifactory).
Thus, the correct answer is D.
What does the term "wisdom of production" mean?
Options:
Taking an engineering-based approach to problems rather than just toiling at them repeatedly
The wisdom gained from something running in production
Monitoring and alert notifications from staging environments
If a task can be automated then it should be automated
Answer:
BExplanation:
Comprehensive and Detailed Explanation From Exact Extract:
The term “wisdom of production” refers to the insights gained from real systems running under actual production conditions. Only production environments exhibit real user behavior, real workloads, true performance characteristics, and authentic failure modes. This concept is rooted in the SRE philosophy that production is the ultimate source of truth for understanding system behavior.
From the SRE Workbook, Chapter “Monitoring”:
“Only production provides the full truth about how a system behaves under real workloads. Production is the ultimate source of wisdom about the system.”
This makes clear that wisdom gained from production is indispensable. Testing and staging environments cannot reproduce all real-world variables, usage patterns, and failure pathways.
Why the other options are incorrect:
A describes engineering approaches but does not define “wisdom of production.”
C is incorrect because staging environments do not provide production wisdom.
D relates to automation strategy, not production insights.
Thus, the accurate meaning of the term is B — The wisdom gained from something running in production.