Infrastructure Monitoring Tutorial
Overview
Infrastructure monitoring is essential for maintaining system reliability, performance, and availability. This tutorial covers modern monitoring strategies, tools, and best practices for comprehensive infrastructure observability.
Core Monitoring Concepts
Metrics, Logs, and Traces
Metrics are numerical measurements collected over time intervals:
- System metrics: CPU usage, memory consumption, disk I/O
- Application metrics: request rates, response times, error counts
- Business metrics: user signups, revenue, conversion rates
Logs provide detailed event records:
- Application logs: debug, info, warn, error messages
- System logs: kernel messages, security events
- Access logs: web server requests, API calls
Traces track requests across distributed systems:
- Request flow through microservices
- Performance bottleneck identification
- Error propagation analysis
The Four Golden Signals
- Latency: Response time for requests
- Traffic: Demand on your system (requests per second)
- Errors: Rate of failed requests
- Saturation: Resource utilization levels
Monitoring Tools and Platforms
Prometheus and Grafana
Prometheus is a time-series database and monitoring system:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'application'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
Grafana provides visualization and dashboards:
{
"dashboard": {
"title": "System Overview",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
]
}
]
}
}
Application Performance Monitoring (APM)
New Relic Agent Setup:
// newrelic.js
'use strict';
exports.config = {
app_name: ['My Application'],
license_key: 'your_license_key',
logging: {
level: 'info'
}
};
Datadog Integration:
from datadog import initialize, statsd
initialize(api_key='your_api_key', app_key='your_app_key')
# Custom metrics
statsd.increment('web.requests', tags=['status:200'])
statsd.histogram('web.response_time', 0.25, tags=['endpoint:api'])
Infrastructure as Code Monitoring
Terraform State Monitoring
# monitoring.tf
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-usage"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "120"
statistic = "Average"
threshold = "80"
alarm_description = "This metric monitors ec2 cpu utilization"
dimensions = {
InstanceId = aws_instance.web.id
}
}
Kubernetes Monitoring
# kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
containers:
- name: kube-state-metrics
image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
Alerting Strategies
Alert Rules in Prometheus
# alerts.yml
groups:
- name: system_alerts
rules:
- alert: HighMemoryUsage
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 90% for more than 5 minutes"
- alert: DiskSpaceRunningLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.2
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space running low"
description: "Less than 20% disk space remaining"
PagerDuty Integration
import pypd
pypd.api_key = "your_api_key"
# Create incident
incident = pypd.Incident.create(
type='incident',
title='High CPU Usage Alert',
service=pypd.Service.find_one(name='Web Service'),
urgency='high',
body={
'type': 'incident_body',
'details': 'CPU usage exceeded 90% threshold'
}
)
Log Aggregation and Analysis
ELK Stack Setup
Elasticsearch Configuration:
# elasticsearch.yml
cluster.name: logging-cluster
node.name: node-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node
Logstash Pipeline:
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [fields][service] == "web" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
Kibana Dashboard:
{
"version": "7.15.0",
"objects": [
{
"type": "dashboard",
"attributes": {
"title": "Application Logs",
"panelsJSON": "[{\"type\":\"search\",\"query\":{\"match\":{\"level\":\"ERROR\"}}}]"
}
}
]
}
Fluentd for Log Collection
# fluent.conf
<source>
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd/app.log.pos
tag app.logs
format json
</source>
<match app.logs>
@type elasticsearch
host elasticsearch.local
port 9200
index_name app-logs
type_name logs
</match>
Synthetic Monitoring
Uptime Monitoring
// synthetic-monitor.js
const puppeteer = require('puppeteer');
async function checkWebsite() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
const response = await page.goto('https://example.com', {
waitUntil: 'networkidle2',
timeout: 30000
});
const loadTime = await page.evaluate(() => {
return performance.timing.loadEventEnd - performance.timing.navigationStart;
});
console.log(`Status: ${response.status()}, Load Time: ${loadTime}ms`);
} catch (error) {
console.error('Website check failed:', error);
} finally {
await browser.close();
}
}
setInterval(checkWebsite, 60000); // Check every minute
API Health Checks
import requests
import time
import json
def health_check():
endpoints = [
'https://api.example.com/health',
'https://api.example.com/status',
'https://api.example.com/metrics'
]
for endpoint in endpoints:
try:
start_time = time.time()
response = requests.get(endpoint, timeout=10)
response_time = (time.time() - start_time) * 1000
metrics = {
'endpoint': endpoint,
'status_code': response.status_code,
'response_time_ms': response_time,
'timestamp': time.time()
}
print(json.dumps(metrics))
except requests.exceptions.RequestException as e:
print(f"Error checking {endpoint}: {e}")
if __name__ == "__main__":
while True:
health_check()
time.sleep(30)
Monitoring Best Practices
Baseline Establishment
- Collect baseline metrics for normal operation periods
- Identify seasonal patterns in your application usage
- Set appropriate thresholds based on historical data
- Review and adjust baselines regularly
Alert Fatigue Prevention
# Alert grouping and suppression
groups:
- name: grouped_alerts
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Service {{ $labels.instance }} is down"
# Suppress dependent alerts
- alert: HighLatency
expr: http_request_duration_seconds_p99 > 1
for: 5m
labels:
severity: warning
depends_on: ServiceDown
Monitoring as Code
# monitoring_config.py
import yaml
def generate_prometheus_config():
config = {
'global': {
'scrape_interval': '15s',
'evaluation_interval': '15s'
},
'scrape_configs': []
}
services = [
{'name': 'web-app', 'port': 8080, 'path': '/metrics'},
{'name': 'database', 'port': 9090, 'path': '/db_metrics'},
{'name': 'cache', 'port': 9091, 'path': '/cache_metrics'}
]
for service in services:
scrape_config = {
'job_name': service['name'],
'static_configs': [{'targets': [f"localhost:{service['port']}"]}],
'metrics_path': service['path']
}
config['scrape_configs'].append(scrape_config)
return yaml.dump(config, default_flow_style=False)
Troubleshooting and Optimization
Performance Bottleneck Identification
- Monitor resource utilization trends
- Analyze request patterns and traffic spikes
- Review error logs for common failure modes
- Use distributed tracing for complex request flows
Capacity Planning
# capacity_planning.py
import numpy as np
from sklearn.linear_model import LinearRegression
def predict_capacity_needs(historical_data):
# Prepare time series data
X = np.array(range(len(historical_data))).reshape(-1, 1)
y = np.array(historical_data)
# Train linear regression model
model = LinearRegression()
model.fit(X, y)
# Predict next 30 days
future_days = np.array(range(len(historical_data), len(historical_data) + 30)).reshape(-1, 1)
predictions = model.predict(future_days)
return predictions
# Usage example
cpu_utilization = [45, 47, 52, 48, 51, 55, 58, 62, 59, 61] # Historical data
future_usage = predict_capacity_needs(cpu_utilization)
print(f"Predicted peak usage in 30 days: {max(future_usage):.1f}%")
Infrastructure monitoring is crucial for maintaining reliable systems. Start with basic metrics collection, gradually implement advanced monitoring patterns, and always focus on actionable alerts that help you respond to issues quickly and effectively.
Content Review
The content in this repository has been reviewed by chevp. Chevp is dedicated to ensuring that the information provided is accurate, relevant, and up-to-date, helping users to learn and implement programming skills effectively.
About the Reviewer
For more insights and contributions, visit chevp's GitHub profile: chevp's GitHub Profile.