Author: Shea Stewart


The deployment planning is complete, and the installation is done. Your private OpenShift (Origin or Enterprise) cluster is up and running! Now, what about day 2 operations?

For ongoing management of the cluster, we often focus on:

  • Compliance and Planning
    • Reporting, capacity planning, security vulnerability scanning & workflow/request management
  • Performance Monitoring and Alerting
    • Real-time low-level host monitoring, alerting, and troubleshooting tools

For anyone that has deployed ManageIQ or CloudForms, the compliance and planning piece is well handled (see this demo ). If you are running your OpenShift cluster on top of a stack that supports SmartState Analysis (ie. RHEV, Hyper-V, OpenStack, VMware, Azure), then you can tie some of the performance monitoring elements of each VM as well. This solution is agentless and leverages API queries and scheduled scans to perform its tasks.

When allowing production workloads to run on your cluster, you will likely need a separate management platform to handle performance monitoring and alerting. In this blog, we will explore Sysdig Cloud, which is a SaaS monitoring solution that focuses on container infrastructure monitoring. This solution is simple to deploy using the sysdig openshift instructions, and within minutes you can have deep insight into the performance elements of your entire cluster. Here is a list of Sysdig Cloud features.

The Setup

The setup was simple following the instructions posted above. Once the application has been deployed, you will see one container running on each of your Openshift nodes, ready to collect and send data to Sysdig Cloud.

Logging into Sysdig Cloud

Once logged into Sysdig Cloud (using the 14-day free trial ), we can see the beta Overview dashboard. This page provides a high-level status that allows us to jump to the explore page for any element that looks out of place.

Explore

Looking under Explore, we can dig much deeper into the collected elements and metrics.

Here we can select the individual process and review the detailed utilization in tabular or visual format.

We also have immediate access to remote SSH and sysdig capture tools (seen in the top right of the screen). We can also create an specific Alert for this element from this screen as well.

In another view we can view the statistics specific to a container. We can see the detailed breakdown (by container) for the entire host.

We can explore the file system breakdown of each host, or container.

We can also see the specific file I/O details for each process. Below, you can see that my elasticsearch container is performing the majority of the writes on this host (325 write iops).

The metrics tab also allows us to lay out time series charts with overlays of events. Hovering over the event, we can see the specific details which may assist in troubleshooting an issue.

Hovering over a specific element will reveal additional details.

The Sysdig team has also added detailed explanations of each collected metric.

Navigating further down the left hand side of this page, a Topology view is available to illustrate how each component is connected. In this example, we can view the response times of network requests. In addition to response times, we can view the cpu utilization and network traffic from a given element to its connected peers.

Dashboards

Moving along the top of the page, we can select and configure specialized dashboards. In this instance, we have simply used the standard Kubernetes dashboard. Dashboards can be customized, saved, and shared.

The layout is also very flexible, allowing us to view more charts and change the layout of the page if desired.

Events

As we continue to the right, we end up on the Events page, where we find all collected events. From here, you can select an event and quickly jump to additional details with all related elements.

Alerts

What good would all this data be if we had to stare at the dashboard all day? So, we move to the robust alerting page where we can create very granular alerts. For example, we could create an alert that monitors container restarts in our default project, or another critical application.

One also very helpful feature is that we can trigger a sysdig capture at the occurrence of an alert, shipping it to Sysdig Cloud (or our customized S3 bucket) for further analysis.

Storage for Sysdig Capture Files

You can also customize the sysdig capture destination towards your own S3 bucket instead of their cloud service, enabling you to perform additional analysis tasks against the raw data if required.

Conclusion

Sysdig Cloud provides a rapid and cost efficient way to gain deep insight into the performance of an OpenShift cluster. With access to this data, OpenShift operators can easily understand the health of the cluster, enable customized alerting, and have instant access to troubleshooting tools to help diagnose underlying problems. Day 2 operations of an OpenShift cluster will be difficult without this level of visibility, and made much easier when combined with Sysdig Cloud and ManageIQ/CloudForms. It’s also worth mentioning that this solution can be purchased as a SaaS offering or as an on-prem private deployment.

Links and references:

Tagged:



//comments