16: Logging & Observability (Serilog, Loki, Grafana, OpenTelemetry)¶

Section 16 explains how LocalCloudLab achieves full-stack observability using:

• Serilog (structured logging in .NET)
• Loki (log aggregation)
• Grafana (dashboards & visualization)
• OpenTelemetry (traces & metrics)
• Log/trace correlation (TraceId, SpanId)
• Best practices for performance, cost, and clarity

This section is critical for debugging production issues, performance tuning, and proving correctness across the entire system.

16.1 Observability Architecture Overview¶

Your cluster uses a three-layer observability stack:

1. Logs → Loki¶

• Structured JSON logs
• Easy querying
• Correlated with traces

2. Traces → OpenTelemetry → Tempo/Jaeger¶

• Distributed tracing
• Understand request flows
• Profile system performance

3. Metrics → Prometheus → Grafana¶

• CPU, memory, latency distributions
• Error rates
• Custom app metrics

These three pillars provide:

✓ Root cause analysis
✓ Performance tuning
✓ Debugging across microservices
✓ Infrastructure monitoring

16.2 Logging Strategy in LocalCloudLab¶

We use Serilog with:

• JSON formatting
• TraceId & SpanId enrichment
• Minimal string templates
• Structured log events

You already added:

"outputTemplate": "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} | TraceId={TraceId} SpanId={SpanId}{NewLine}"

But Loki works best with JSON logs, not plain text.

Recommended Serilog JSON configuration¶

In appsettings.json:

"Serilog": {
  "Using": [ "Serilog.Sinks.Console", "Serilog.Sinks.Seq" ],
  "MinimumLevel": {
    "Default": "Information",
    "Override": {
      "Microsoft": "Warning",
      "System": "Warning"
    }
  },
  "WriteTo": [
    {
      "Name": "Console",
      "Args": {
        "formatter": "Serilog.Formatting.Json.JsonFormatter, Serilog"
      }
    },
    {
      "Name": "Seq",
      "Args": {
        "ServerUrl": "http://seq.infra.svc.cluster.local"
      }
    }
  ],
  "Enrich": [ "FromLogContext", "WithSpan", "WithTraceId" ]
}

JSON logs ensure:

• Loki can parse fields
• TraceId and SpanId attach correctly
• Querying by field becomes trivial

16.3 Adding TraceId and SpanId to All Logs¶

Inside Program.cs:

builder.Services.AddOpenTelemetry()
    .WithTracing(t =>
    {
        t.AddAspNetCoreInstrumentation();
        t.AddHttpClientInstrumentation();
        t.AddEntityFrameworkCoreInstrumentation();
        t.AddRedisInstrumentation();
        t.AddOtlpExporter();
    });

Then add:

LogContext.PushProperty("TraceId", Activity.Current?.TraceId.ToHexString());
LogContext.PushProperty("SpanId", Activity.Current?.SpanId.ToHexString());

Or use Serilog enrichment package:

<PackageReference Include="Serilog.Enrichers.Span" Version="3.0.0" />

This makes logs and traces automatically link.

16.4 What Should Be Logged?¶

Log these:¶

• Errors (with stack traces)
• Warnings
• Business events ("SearchCompleted", "CheckinCompleted")
• Integration failures (database, Redis, RabbitMQ)
• Performance outliers
• External API call failures

Do NOT log:¶

✗ Credentials
✗ Tokens
✗ Raw exceptions from third-party frameworks
✗ Full HTTP bodies for large requests

Goal: Logs must be helpful, not expensive.

16.5 Loki Configuration in Kubernetes¶

Loki receives logs from:

• Promtail (daemonset)
• Container stdout (Serilog JSON)

In k3s, Promtail automatically collects logs from:

/var/log/containers/*.log

Make sure Promtail is installed (from earlier sections).

Typical promtail job:¶

kubectl get pods -n monitoring

You should see:

promtail-xxxxx

16.6 Grafana LogQL Basics¶

Query logs:

{app="search-api"}

Filter by severity:

{app="search-api"} |= "error"

Find logs associated with a trace:

{app="search-api", TraceId="e4b9f47f2e2c4db0"}

Find slow queries:

{app="search-api"} |= "db" |= "duration"

Find Redis issues:

{app="search-api"} |= "redis"

Find rabbit failures:

{app="search-api"} |= "rabbit"

16.7 Correlating Logs & Traces in Grafana¶

Grafana automatically detects when logs contain:

• trace_id
• span_id

You should structure logs like:

{
   "Timestamp": "...",
   "Level": "Error",
   "Message": "Search failed...",
   "TraceId": "82df10e6c98192af",
   "SpanId": "51ec81f3d0ab1c0d"
}

When viewing logs:

→ Click “View Trace”
→ Grafana opens Jaeger/Tempo trace view
→ You see upstream/downstream spans

This is essential for debugging multi-service requests.

16.8 Building Grafana Dashboards¶

Create dashboards with panels:

Panel: Request Latency (p50, p90, p95, p99)¶

histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le))

Panel: Error Rate¶

sum(rate(http_server_request_duration_seconds_count{status!~"2.."}[5m]))

Panel: Database Query Latency¶

sum by(operation) (rate(sqlclient_command_duration_seconds_count[1m]))

Panel: Redis Operation Time¶

sum(rate(redis_client_commands_duration_seconds_sum[5m]))
/ sum(rate(redis_client_commands_duration_seconds_count[5m]))

Panel: RabbitMQ Throughput¶

rate(rabbitmq_messages_published_total[1m])

Panel: CPU, Memory, Pods¶

Built into node exporter + kube-state-metrics.

16.9 Cost Optimization for Logs¶

Do:¶

✓ JSON logging
✓ Avoid logging every request
✓ Keep retention short (7–14 days)
✓ Sample noisy logs (debug-level)

Don't:¶

✗ Log entire objects
✗ Log huge payloads
✗ Use text logs (inflate size)

Goal: Beautiful observability, tiny storage cost.

16.10 Summary of Section 16¶

You now have:

✔ Fully structured logs (Serilog → Loki)
✔ Full OpenTelemetry tracing
✔ Automatic trace-log correlation
✔ Real-time Grafana dashboards
✔ Alerting ideas for future work
✔ A production-grade observability stack

Next section will continue automatically:

Section 17 – Scaling & Resilience (Horizontal Pod Autoscaling, Resource Limits, Readiness Probes, Retry Strategy)