CN310: Advanced Docker Enterprise Troubleshooting

Take control of day 2 operations and troubleshooting for the Docker Enterprise platform.

Description

In this support and SRE focused course, you’ll learn broadly applicable techniques for diagnosing platform and application failures in Docker Enterprise. We’ll cover application first-response strategies for Swarm and Kubernetes applications, look at identifying and avoiding common cluster failure modes, and practice troubleshooting and disaster recovery actions for UCP and DTR. This course is intended to help experienced Docker Enterprise operators self-serve a wide range of support needs, reducing time to resolution and expediting results in support service requests.

Who Should Attend

This course is targeted at students with the following:

  • Motivations: Provide support and day-2 ops for production-grade Docker Enterprise clusters hosting mission critical applications.
  • Roles: SREs, support teams or operators managing Docker Enterprise

Lab Requirements

  • Laptop with WiFi connectivity
  • Attendees should have the latest Chrome or Firefox installed, and a free account at strigo.io.

Course Objectives

  • Containerized application diagnostic strategies
    • Containerization tooling audits and tracing
    • Workload tracing and troubleshooting
    • Network tracing
    • Severity triaging & identifying genuine problems
  • Logging & Monitoring Strategies
    • Sources of platform and application data
    • Manipulating and ingesting container logging data
  • Docker Enterprise Documentation
    • Navigating the docs
    • Finding usage, troubleshooting and best practice documentation
  • UCP Support Dumps
    • Generating support dumps automatically and manually
    • Interpreting the contents of support dumps
  • Troubleshooting Resource Problems
    • Detecting memory, CPU and I/O constraints
    • Mitigating resource overconsumption
  • Troubleshooting Networking Problems
    • Review of Swarm networking implementation
    • Common Swarm networking problems and mitigations
    • UCP networking requirements, failures and mitigations
    • Swarm and Kube DNS troubleshooting
  • Troubleshooting UCP
    • Correlating UCP errors with UCP components and logs
    • Investigating state reconciliation failures with etcd and rethinkdb
  • Troubleshooting DTR
    • Correlating DTR errors with DTR components and logs
    • DTR resourcing and sizing for mitigating poor performance
    • Auditing DTR job logs and activity monitors
    • Automated DTR disaster recovery
  • Disaster Recovery
    • Backing up Swarm, UCP and DTR
    • Restoring from backups

VIEW PUBLIC CLASS SCHEDULE
REQUEST PRIVATE TRAINING