貴社では本番環境システムをGoogle Cloudに移行中です。将来起こり得るインシデントによる顧客への影響を最小限に抑えるため、移行期間中はサイト信頼性エンジニアリング(SRE)のプラクティスを実装する必要があります。実装すべきSREプラクティスを2つ挙げてください。
2つの回答を選択してください
正解:B,E
Comprehensive and Detailed Explanation From General SRE Principles and Google Cloud Knowledge:
Site Reliability Engineering (SRE) emphasizes reliability, automation, and a data-driven approach to operations. The goal is to minimize the "time to detect" (TTD) and "time to resolve" (TTR) for incidents.
Option A (Ensure that full autonomy and permissions are only granted to the on-call team): While the on-call team needs appropriate permissions to act decisively during an incident, granting full autonomy and only to them can be a bottleneck and goes against the principle of least privilege if not carefully scoped. Broader teams might need specific, controlled access for their responsibilities. SRE encourages empowering teams but within a structured framework.
Option B (Automate common tasks to analyze key impact information and intelligently suggest mitigating actions for the on-call team): This is a core SRE practice. Automation reduces toil, speeds up response, and ensures consistency. Analyzing impact and suggesting mitigations helps the on-call team resolve issues faster and more effectively.
Option C (Ensure that all teams can modify the production environment to resolve issues): This is generally a bad practice and against SRE principles of controlled changes and reducing the blast radius of errors.
Production changes should be managed, audited, and ideally automated, not open to modification by all teams, as this increases the risk of unintended incidents.
Option D (Create an alerting mechanism for your SRE team based on your system's internal behavior): While alerting is crucial, SRE emphasizes alerting on symptoms that affect users (Service Level Objectives - SLOs) rather than just internal behavior or causes. Alerting solely on internal behavior can lead to alert fatigue and may not correlate directly with user impact. Good alerting focuses on user-facing impact first.
Option E (Create up-to-date playbooks with instructions for debugging and mitigating issues): Playbooks (or runbooks) are essential in SRE. They document known issues, troubleshooting steps, and mitigation procedures. Keeping them up-to-date ensures that on-call engineers can respond to incidents quickly and consistently, even for less common issues, thereby minimizing customer impact.
Therefore, automating incident response tasks (B) and maintaining clear, actionable playbooks (E) are two key SRE practices to implement for minimizing customer impact.
Reference (Based on SRE principles):
The SRE books by Google (e.g., "Site Reliability Engineering: How Google Runs Production Systems") heavily emphasize automation to reduce toil and the importance of playbooks for incident management.
Google Cloud SRE solutions: https://cloud.google.com/sre
Specifically, regarding playbooks and automation:"Playbooks should be living documents, updated regularly as systems change and new incidents provide new lessons."
"SREs aim to automate repetitive tasks (toil) to free up time for engineering projects that improve reliability."