06: Observability Stack (Prometheus, Grafana, Loki, Promtail, Jaeger)¶

6.1 Introduction¶

Observability is a critical foundation of any modern microservices platform. In LocalCloudLab, your system consists of:

• Kubernetes (k3s)
• Envoy Gateway
• Multiple .NET APIs (Search, Checkin, etc.)
• PostgreSQL, Redis, RabbitMQ, and other components

To understand what is happening inside such a system, you need metrics, logs, and traces. Observability gives you the capability to answer questions like:

• Is the cluster healthy?
• Are APIs responding quickly?
• Why did a request fail?
• What is the database latency?
• Which service caused a spike in errors?
• How does a request flow through all components?

This section describes how to install and configure a complete observability suite:

• **Prometheus** — metrics collector
• **Grafana** — dashboards and visualization
• **Loki** — log aggregation
• **Promtail** — log ingestion into Loki
• **Jaeger** — distributed tracing
• **OpenTelemetry** — unifying traces, metrics, and logs
• **Correlating logs + traces** inside Grafana UI

This is one of the most powerful sections in the entire LocalCloudLab book.

6.2 Observability Architecture in LocalCloudLab¶

The architecture follows modern cloud-native principles:

            +------------------------+
            |      Grafana UI        |
            |   Dashboards + Explore |
            +-----------+------------+
                        |
    +-------------------+-----------------------+
    |                   |                       |

Signal origins:

• Kubernetes components → Prometheus + Loki
• Envoy Proxy → Prometheus metrics + access logs
• .NET APIs → Loki logs + Jaeger traces + Prometheus metrics (OTEL exporter)
• Node metrics (CPU/mem/disk) → Prometheus node exporter

By the end of this chapter, you can open Grafana and examine:

• CPU, memory, network of all nodes
• Pod restarts, failures, readiness issues
• API request latency (p50/p90/p99)
• Database latency from your .NET apps
• Full request traces across Search API → Database → Return

6.3 Installing Prometheus and Grafana (kube-prometheus-stack)¶

We will install a modern, non-deprecated, unified chart:

kube-prometheus-stack

This bundle includes:

✔ Prometheus
✔ Grafana
✔ Node exporter
✔ Kube-state-metrics
✔ Alertmanager

This is the most widely used approach for Kubernetes monitoring.

6.3.1 Add Prometheus Helm repo¶

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

6.3.2 Create monitoring namespace¶

kubectl create namespace monitoring

6.3.3 Install kube-prometheus-stack¶

helm install monitoring prometheus-community/kube-prometheus-stack       -n monitoring       --set grafana.service.type=LoadBalancer

MetalLB will assign a public IP for Grafana.

If you want Grafana internal-only, change LoadBalancer to ClusterIP.

6.3.4 Verify installation¶

kubectl get pods -n monitoring

You should see:

alertmanager-monitoring-kube-prom-alertmanager-0   Running
grafana-xxxxxx                                     Running
prometheus-monitoring-kube-prom-prometheus-0       Running
kube-state-metrics-xxxxxx                          Running
prometheus-node-exporter-xxxxx                     Running

6.3.5 Accessing Grafana¶

Check service:

kubectl get svc -n monitoring | grep grafana

Example output:

monitoring-grafana   LoadBalancer   172.18.255.201   80:31848/TCP

Visit in browser:

http://172.18.255.201

Default credentials:

username: admin
password: prom-operator

You should change the password immediately from Grafana UI.

6.3.6 Importing Dashboards¶

Grafana ships with built-in dashboards:

• Kubernetes / Compute Resources / Node
• Kubernetes / Compute Resources / Pod
• API Server
• Node Exporter
• etc.

You can also import community dashboards, such as:

• Envoy Proxy metrics
• OpenTelemetry Collector metrics
• .NET runtime metrics
• PostgreSQL dashboards

We will configure these later.

(End of Part 1 — Part 2 will cover Loki, Promtail, log pipeline setup, and integration with Grafana.)

6.4 Installing Loki (Log Aggregation)¶

Loki is a horizontally-scalable, cost‑efficient, and highly efficient log aggregation system created by Grafana Labs. It is designed to work similarly to Prometheus:

• Prometheus → metrics
• Loki → logs

But unlike Elasticsearch or Splunk, Loki does not index log content, only labels. This makes Loki extremely fast and lightweight — perfect for LocalCloudLab.

6.4.1 Why Loki for LocalCloudLab?¶

✔ Works seamlessly with Grafana
✔ Very low storage requirements
✔ Ideal for microservices
✔ High performance
✔ Perfect integration with Promtail

In LocalCloudLab:

• .NET APIs send logs to stdout → containerd → Promtail → Loki
• Loki stores the logs
• Grafana Explore provides powerful search capabilities
• Logs include TraceIds → used for correlation with Jaeger traces

Loki is also extremely easy to operate inside k3s.

6.4.2 Install Loki with Helm (non-deprecated chart)¶

We install the official Helm chart for Loki single binary mode, which is the simplest and fits LocalCloudLab perfectly.

Add the repo:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Install:

helm install loki grafana/loki       -n monitoring       --create-namespace       --set singleBinary.enabled=true

Verify:

kubectl get pods -n monitoring | grep loki

6.4.3 Loki Storage Considerations¶

Loki stores logs in:

/var/loki/chunks
/var/loki/index

By default, Loki uses emptyDir (ephemeral). For production-like environments, you may switch to:

• persistentVolumeClaim
• filesystem storage
• object storage (S3/MinIO)

For LocalCloudLab, ephemeral storage is fine because:

• Logs are not mission-critical
• Traces + metrics cover long-term observability
• You can always export logs when needed

Later in Section 12 (Disaster Recovery) we describe how to make Loki persistent.

6.5 Installing Promtail (Log Ingestion)¶

Promtail is Loki’s log collector. It runs on every Kubernetes node to scrape:

• /var/log/containers
• /var/log/pods
• /var/log/syslog
• containerd logs

Promtail then enriches logs with Kubernetes labels and sends them to Loki.

6.5.1 Install Promtail via Helm¶

Use the official Grafana chart:

helm install promtail grafana/promtail       -n monitoring       --set config.lokiAddress=http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push

Verify:

kubectl get pods -n monitoring | grep promtail

6.5.2 How Promtail Reads Logs¶

Promtail scrapes logs from:

/var/log/pods/<pod>/<container>.log

Since containerd writes logs to this folder, Promtail captures everything.

It automatically parses:

• timestamps
• log level
• message
• Kubernetes labels (namespace, pod, container)

6.5.3 Adding TraceId to .NET Logs (Already Done in Previous Sections)¶

Your .NET APIs already include:

• middleware injecting TraceId into logs
• Serilog output with TraceId and SpanId
• OpenTelemetry tracing

Promtail will pick up these logs.

Later in this document, you will see how Grafana can correlate:

Logs → Traces → Metrics

All in a single UI screen.

6.6 Configuring Grafana to Use Loki¶

Once Loki and Promtail are running, you must add Loki as a data source in Grafana.

Visit:

http://<grafana-external-ip>

Login with:

admin / prom-operator

(unless you have already changed the password)

6.6.2 Add Loki Data Source¶

In Grafana UI:

→ Configuration (gear icon)
→ Data Sources
→ Add data source
→ Select “Loki”

Set URL:

http://loki.monitoring.svc.cluster.local:3100

Click Save & Test.

You should see:

Data source is working

6.6.3 Exploring Logs in Grafana¶

Go to:

Explore → Loki

Try filtering logs by namespace:

{namespace="search"}

Or by TraceId (assuming your log format includes “TraceId=”):

{app="search-api"} |= "TraceId="

This allows you to jump directly from logs to traces.

6.6.4 Useful LogQL Examples¶

All logs for Search API:¶

{app="search-api"}

Only warnings and errors:¶

{app="search-api"} |= "warn"
{app="search-api"} |= "error"

Filter by path:¶

{app="search-api"} |= "/api/search"

Filter by TraceId:¶

{namespace="search"} |= "TraceId=12345"

Count logs per level:¶

sum by (level) (count_over_time({app="search-api"}[5m]))

You will get a histogram of logs by severity.

6.7 Connecting Logs to Traces (Cross-Signal Correlation)¶

This is one of the most powerful capabilities in LocalCloudLab.

Because:

• Logs include TraceId
• Traces include TraceId
• Grafana can read both Loki and Jaeger

Grafana allows you to:

1. View a log line
2. Click “View Trace”
3. Jump directly into the full request trace in Jaeger

Or the reverse:

1. Open a trace in Jaeger
2. Click a span
3. See logs emitted during that span

This gives a production-grade debugging workflow identical to:

• AWS X-Ray
• Datadog
• NewRelic
• Elastic APM

Later in Section 6.9 we configure Jaeger fully.

(End of Part 2 — Part 3 will cover Jaeger, OpenTelemetry configs, and full logs + traces correlation.)

6.8 Installing Jaeger (Distributed Tracing)¶

Jaeger is a CNCF graduated tracing system originally developed by Uber. It provides:

• End-to-end request tracing
• Span visualization
• Latency analysis
• Service dependency graphs
• Integration with OpenTelemetry

In LocalCloudLab, Jaeger receives traces from:

• Your .NET APIs via OTLP exporter
• Envoy Gateway (optional)
• Any future microservices

We will deploy the Jaeger All-In-One Helm chart, which is perfectly suitable for a single-node environment and lightweight enough for k3s.

6.8.1 Add Jaeger Helm Repository¶

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

6.8.2 Install Jaeger All-In-One¶

helm install jaeger jaegertracing/jaeger       -n monitoring       --set provisionDataStore.cassandra=false

This installs:

• Jaeger collector
• Jaeger query UI
• Jaeger ingester
• Jaeger agent (deprecated but still included)
• In-memory storage (sufficient for LocalCloudLab)

Check pods:

kubectl get pods -n monitoring | grep jaeger

6.8.3 Expose Jaeger UI¶

By default, the Jaeger query service is ClusterIP. To access from your browser, patch it:

kubectl patch svc jaeger-query -n monitoring       -p '{"spec": {"type": "LoadBalancer"}}'

MetalLB will assign an IP, e.g.:

172.18.255.202

6.8.4 Accessing Jaeger¶

Open:

http://172.18.255.202

You will see:

• Search traces
• Filter by service
• View spans
• Timeline waterfall UI
• Trace logs/events

Later, we will enable:

• Trace → Log correlation
• Service dependency graph

6.9 Configuring .NET APIs for OpenTelemetry Tracing¶

Your APIs already have OpenTelemetry installed, but this section describes the best practices for:

• Automatic instrumentation
• Database tracing
• Custom spans
• Correct OTLP configuration

OpenTelemetry sends traces to Jaeger via:

OTLP → Jaeger Collector → Jaeger UI

6.9.1 Required NuGet Packages¶

OpenTelemetry
OpenTelemetry.Extensions.Hosting
OpenTelemetry.Instrumentation.AspNetCore
OpenTelemetry.Instrumentation.Http
OpenTelemetry.Instrumentation.SqlClient
OpenTelemetry.Exporter.OpenTelemetryProtocol

Install:

dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

6.9.2 Configure OTEL in Program.cs¶

Example:

builder.Services.AddOpenTelemetry()
    .WithTracing(tpb =>
    {
        tpb.SetResourceBuilder(
                ResourceBuilder.CreateDefault()
                    .AddService("SearchAPI"))
           .AddAspNetCoreInstrumentation()
           .AddHttpClientInstrumentation()
           .AddSqlClientInstrumentation()
           .AddOtlpExporter(o =>
           {
               o.Endpoint = new Uri("http://jaeger-collector.monitoring.svc.cluster.local:4318");
               o.Protocol = OtlpExportProtocol.HttpProtobuf;
           });
    });

Replace "SearchAPI" with your actual service name.

For Checkin API:

AddService("CheckinAPI")

6.9.3 Database Spans¶

OpenTelemetry automatically traces SQL queries when AddSqlClientInstrumentation is enabled.

You will see spans like:

SELECT * FROM Hotels WHERE Id = $1
INSERT INTO SearchHistory …

Each includes:

• DB latency
• Response size
• Error details
• Connection time

6.9.4 Custom Spans¶

Example:

using var span = tracer.StartActiveSpan("CalculatePrice");
span.SetAttribute("hotelId", hotelId);
await _priceService.CalculateAsync();

6.9.5 TraceId Propagation to Logs¶

You already enrich logs with:

TraceId={TraceId} SpanId={SpanId}

Promtail → Loki → Grafana Jaeger → Grafana

Grafana can now correlate logs + traces end-to-end.

6.10 Adding Jaeger as a Data Source in Grafana¶

Grafana natively supports Jaeger.

6.10.1 In Grafana UI:¶

→ Configuration → Data Sources → Add data source → Jaeger

Set URL:

http://jaeger-query.monitoring.svc.cluster.local:16686

Click Save & Test.

Status should be:

Data source is working

6.10.2 Viewing Traces in Grafana

Go to:

Explore → Jaeger

You can:

• Search for specific services
• Filter by operation names
• View end-to-end latency
• Drill into span attributes

6.11 End-to-End Observability Validation¶

Now let’s validate that:

• Logs work
• Metrics work
• Traces work
• Correlation works

6.11.1 Generate Traffic¶

From your machine:

curl https://search.hershkowitz.co.il/health
curl https://search.hershkowitz.co.il/api/search?q=test
curl https://checkin.hershkowitz.co.il/health

6.11.2 Validate Metrics in Grafana Dashboards¶

Dashboards to check:

• Kubernetes / Compute Resources / Pod
• API Server dashboard
• Node Exporter dashboard
• .NET Runtime metrics dashboard (importable)

6.11.3 Validate Logs (Loki)¶

In Grafana Explore:

{namespace="search"} |= "TraceId="

You should see logs from Search API with a TraceId.

6.11.4 Validate Traces (Jaeger)¶

In Jaeger UI:

Service → SearchAPI
Find Traces

You should see request flow:

Client → Envoy → Search API → PostgreSQL

6.11.5 Validate Log ↔ Trace Correlation¶

In Grafana Explore (Loki):

Pick a log from Search API → click “View Trace”.

Grafana should open:

Jaeger → TraceView → Spans

6.11.6 Validate Trace ↔ Log Correlation¶

Inside Jaeger span:

• Events section
• Logs section
• Attributes (TraceId, SpanId)

Click "Find logs" → Grafana search opens automatically.

6.12 Summary of Section 6¶

At this point, LocalCloudLab has a world-class observability suite:

✔ Prometheus for metrics
✔ Grafana for visualization
✔ Loki for logs
✔ Promtail for ingestion
✔ Jaeger for tracing
✔ OpenTelemetry for unified instrumentation
✔ Cross-system correlation between logs, traces, and metrics

You now have the ability to diagnose:

• Slow database queries
• API latency spikes
• Network bottlenecks
• Failing routes
• Distributed system issues

The observability stack you have built competes with:

• Datadog
• NewRelic
• Elastic Observability
• AWS X-Ray
• Azure Monitor

At a fraction of the cost—because it is all open-source.

Next section (Section 7) will cover: PostgreSQL, Schema Management, Backups, and Failover Strategy

(End of Section 06 — Complete)