In this article I will provide a working setup for log handling with Grafana Loki and the kubernetes logging operator. You need a kubernetes cluster with kube-prometheus-stack helm chart and logging-operator already installed. To keep it simple, there is no access control and we use filesystem storage for persistence.

Motivation

When setting up a Rancher managed RKE2 cluster, one often uses the Helm charts provided by Rancher to install frameworks for monitoring and logging. The rancher-monitoring chart is basically the kube-prometheus-stack helm chart. Although it is currently a few versions behind its upstream, you get all components for a ready-to-use metrics observability setup (Prometheus, Alertmanager and Grafana). With the rancher-logging chart things look different. What you get there is an (unconfigured) logging operator framework which allows to collect logs and forward them to a certain target, but the target itself is missing. As we have Grafana in place already, using Loki for log handling seems obvious. However, I did not find any working example setup, so here is one that should do the job.

The configurations in this article work out‑of‑the‑box, except for one dependency: In our setup Loki is installed in the namespace platform. If you use another namespace, adjust the Loki URL accordingly.

Configuring the Logging Operator

The logging operator uses Flows to collect the logs and Outputs to forward them to a target. We want to collect all logs of the cluster and send them to one single output, so we need just one ClusterFlow and one ClusterOutput.

The ClusterFlow is quite simple:

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
  name: all-logs-to-loki
  namespace: cattle-logging-system
spec:
  globalOutputRefs:
    - loki

The ClusterOutput needs a few more details. This is mainly due to the fact that the key for a useable Loki setup is to choose a good set of labels. And good mainly means to limit the combined label cardinality. Loki groups all logs into buckets, and for every existing label combination a separate bucket is created. So if your setup consists of four labels with 20 possible values each, you already end up with 20*20*20*20 = 160.000 potential buckets which is far more than Loki will be able to handle efficiently. There is no exact limit, but from my experience try to keep the real cardinality below 1.000. Hence, I strongly recommend to start with a very limited number of labels and only extend it if it really turns out to be too small for your needs.

The ClusterOutput in my setup is:

# See https://kube-logging.dev/docs/configuration/plugins/outputs/loki/
# See https://kube-logging.dev/docs/configuration/plugins/outputs/buffer/

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput

metadata:
# this is the name referenced by the ClusterFlow
  name: "loki"
  namespace: "cattle-logging-system"

spec:
  loki:
# adapt the namespace in the URL if necessary
    url: http://loki-gateway.platform.svc.cluster.local
    buffer:
      compress: gzip
      flush_mode: interval
      flush_thread_count: 10
# compromise between fast visibility of logs in Grafana and number of pushes
      flush_interval: 5s
# nobody needs the thread label in the output.
    include_thread_label: false
    labels:
      namespace: $.kubernetes.namespace_name
# We may even think of removing this one:
      node: $.kubernetes.host
      service_name: $.kubernetes.container_name
      stream: $.stream
    remove_keys:
# Those keys are written to every single log line, so we try to safe some space by removing what we don't need.
      - $.logtag
      - $.kubernetes.labels
      - $.kubernetes_namespace
      - $.kubernetes.docker_id
      - $.kubernetes.pod_id
      - $.kubernetes.container_hash
      - $.kubernetes.container_image
      - $.kubernetes.annotations

As you can see, this setup defines only four labels, all of them with very limited cardinality. For example, the pod_name is not part of it as pods are created and destroyed quite often and hence their cardinality would quickly become large. If searching for a specific pod, use a line filter instead.

Side note: As an alternative to labels and the content of log lines themselves, Loki also offers structured metadata as a way to store attributes related to log output. Recent versions of fluent bit support them, so feel free to give it a try.

Now as we are ready to send logs to Loki, let’s set up the target.

Loki

For Loki installation we use the Loki Helm chart provided by Grafana. The installation via helm is already documented here, so we concentrate on the values you need to configure.

For clusters with limited log volume using filesystem storage is totally sufficient. Of course, if you have object storage available, feel free to use it. For larger log volumes it will be necessary anyways.

Filesystem storage requires Loki to run in SingleBinary mode, which does not support scaling. The drawback is obvious: Loki cannot scale beyond the capacity of the node it runs on. But it keeps the setup simple. In our cluster we use Longhorn as storage provider, so the volume Loki stores its data is highly available and not bound to a certain node. Without Longhorn or a similar storage provider the setup will still work, but then Loki is pinned to the node where it came up on first start.

The complete values.yaml of our setup looks like this:

deploymentMode: "SingleBinary"

# If you do not run RKE2 or use a different DNS setup, you need to adapt or remove this definition
global:
  dnsService: rke2-coredns-rke2-coredns

loki:
  commonConfig:
    replication_factor: 1
  auth_enabled: false
  monitoring:
    selfMonitoring:
      enabled: false
      grafanaAgent:
        installOperator: false
  storage:
    type: 'filesystem'
  schemaConfig:
# As of now, this is the latest available storage schema.
    configs:
      - from: "2024-05-01"
        object_store: filesystem
        store: tsdb
        schema: v13
        index:
          prefix: index_
          period: 24h
  limits_config:
    max_line_size: 0
    ingestion_rate_mb: 500
# In case you want to use them :-)
    allow_structured_metadata: true
    volume_enabled: true
# Watch the usage of the volume and adjust, if needed.
    retention_period: 180d
# Cool recent feature of Loki, tries to find patterns in your logs and groups them accordingly. 
# Needs a recent version of Grafana
  pattern_ingester:
    enabled: true
  ingester:
# Depending on your log volume this may need tuning
    # wait up to 2 hours for chunks to be filled
    chunk_idle_period: 120m
    # uncompressed maximum block size per chunk
    chunk_block_size: 4194304
lokiCanary:
  enabled: false
test:
  enabled: false
write:
  replicas: 0
read:
  replicas: 0
backend:
  replicas: 0
singleBinary:
  replicas: 1
  service:
    annotations:
      prometheus.io/port: "3100"
      prometheus.io/scrape: "true"
# Deploys a set of dashboards and alert rules useful to check and tune the setup
monitoring:
  dashboards:
    enabled: true
  rules:
    enabled: true
  serviceMonitor:
    enabled: true

A few settings may need attention:

The values don’t include a namespace configuration. We do GitOps using Rancher Fleet, the namespace platform is set there in fleet.yaml.
The volume size is not set in this config, so the default of 10Gi applies. Depending on your needs, set a different volume size or adjust the retention period

You now need to tell Grafana about Loki. This is done using a datasource definition. Grafana can be configured to load datasources from config maps. The definition in our case is:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-datasource
  namespace: cattle-monitoring-system
  labels:
    grafana_datasource: "1"
data:
  loki-datasource.yaml: |-
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        access: proxy
# Adjust if using a different namespace
        url: http://loki-gateway.platform.svc.cluster.local
        jsonData:
          timeout: 60

For Grafana to check for new datasources, alerts and dashboards in config maps, the following – strongly recommended – configuration has to be part of the values definition of the rancher-monitoring / kube-prometheus-stack chart, so add it if it is not yet there.

grafana:
  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL
    datasources:
      enabled: true
      searchNamespace: ALL
    alerts:
      enabled: true
      searchNamespace: ALL

You should now see Loki as data source in Grafana and be able to search logs or dig through them with the recently added Explore / Drilldown features.

Caveats

The setup does not contain any access control. As Loki is only reachable from within the kubernetes cluster, this should be ok in many cases. But ensure that authentication is active for your Grafana UI.
There is no filtering on your logs, so it may contain sensitive data. If this is a problem, either the logging of your workloads or the ClusterFlow may need tuning.
The default Loki dashboards don’t work well with the SingleBinary setup, so many visualizations show no data. Use the metrics explorer instead.

Happy searching!

Loki and Logging-Operator – a Quick-start

Motivation

Configuring the Logging Operator

Loki

Caveats

Schreibe einen Kommentar Antwort abbrechen

Motivation

Configuring the Logging Operator

Loki

Caveats

Ähnliche Beiträge

Schreibe einen Kommentar Antwort abbrechen