Navigating Kubernetes Incident Response with Falco, CRIU and OpenFaaS

Published in

Fraktal

11 min readSep 28, 2023

In the dynamic world of Kubernetes, containers are ephemeral entities, constantly being created, destroyed, and recreated. This transient nature, while beneficial for scalability and resilience, poses a significant challenge when it comes to forensic investigations. When a security incident occurs within a container, the ephemeral nature of Kubernetes can often erase crucial evidence before investigators have a chance to examine it.

We set out so solve this problem by building a container checkpointing solution. It allows us to automatically capture a snapshot of an affected container, preserving its state for detailed analysis. This snapshot can serve as a valuable piece of evidence, enabling forensic investigators to thoroughly examine the container without disrupting the attacker’s path. This not only aids in understanding the nature of the attack but also helps in devising robust countermeasures.

In this blog we delve into creating a simple, yet effective, incident response mechanism within Kubernetes, leveraging Falco for threat detection, CRIU for container snapshotting, and OpenFaaS for automating responses.

The solution

Falco will serve as our primary detection tool, identifying potential security threats within our Kubernetes clusters. When a threat is detected, Falco will trigger a OpenFaaS function that will use CRIU will take a snapshot of the compromised container, creating a streamlined response mechanism.

Solution architecture and event sequence.

In essence, our approach brings together real-time alerting, orchestrated response, and non-disruptive forensics. It empowers us to detect, respond to, and learn from security incidents without interrupting the attacker or compromising the integrity of our containerized applications.

This approach can be expanded into a complete k8s detection and response engine, where OpenFaaS functions are responsible for automating response activities based on the type and severity of the malicious behavior detected by Falco.

In part 2 of the blog, we will dive into conducting forensic analysis on the offending container’s snapshot using automation.

Overview of the main components

Before diving into the details of how the technologies that we are going to use integrate with each other, we need to give a short description about the main components.

Detecting malicious activity with Falco

Falco is an open-source, cloud-native runtime security project that was originally developed by Sysdig. It is designed to detect anomalous activity in applications running in a Kubernetes environment. Falco works by continuously monitoring and detecting container, application, host, and network activity, and then comparing this activity to a set of rules that define malicious behavior. If any activity violates these rules, Falco generates an alert.

Falco leverages the Linux kernel’s system call information via eBPF (extended Berkeley Packet Filter) and the Kubernetes audit log to gain deep visibility into the behavior of the system. This allows it to provide detailed information about the offending process, including its command line, system call, and network activity.

Falco can be set up to send alerts to multiple outputs, including an HTTP output.

Sending alerts to a REST endpoint would be highly useful in our case, as it will allow us to seamlessly integrate Falco alerts with OpenFaaS functions.

OpenFaaS: Function as a Service

OpenFaaS is a serverless framework that allows packaging code into lightweight functions and deploy them in a Kubernetes cluster. It simplifies the deployment and scaling of functions, making it easy to build event-driven architectures. These functions can be triggered based on various events, such as HTTP requests, message queues, or custom events.

CRIU: Checkpoint/Restore in Userspace

CRIU (Checkpoint/Restore in Userspace) is a powerful tool that enables the checkpointing and restoration of running processes. By capturing the state of a container, including its memory, file system, and network connections, CRIU provides a mechanism for analysis, debugging, and recovery.

Setting up the Kubernetes cluster

Before we dive into the technical intricacies of integrating Falco, CRIU, and OpenFaaS for enhanced security, it’s essential to ensure that our Kubernetes cluster is appropriately configured. In this section, we’ll walk through the prerequisites and the steps needed to prepare our cluster for this robust security setup.

Kubernetes cluster deployment

First, you need a running Kubernetes cluster. In our case, we’ve deployed it on our local machine for development and testing purposes. You can choose your preferred method for cluster deployment, such as using Minikube, Kind, or a managed Kubernetes service like GKE or EKS, depending on your environment and requirements.

Container Runtime: CRI-O

Our security strategy hinges on the capabilities of CRI-O, a lightweight container runtime specifically designed for Kubernetes. To harness the full potential of CRI-O and its checkpoint features, ensure that you have CRI-O version 1.25.0 or higher installed on the nodes running the cluster and configured as container runtime for the cluster itself.

To install or upgrade CRI-O, you can follow the official documentation for your specific operating system and Kubernetes distribution.

Additionally, to use checkpointing in combination with CRI-O, we need to enable CRIU. To do so, the runtime needs to be started with the command-line option — enable-criu-support=true .

Enabling ContainerCheckpoint feature gate

The possibility of taking checkpoints of containers is still an alpha feature in Kubernetes and so it is behind a so-called feature gate. To have access to this functionality, we need to enable the ContainerCheckpoint feature gate on the Kubernetes cluster. This feature gate is essential for exposing the checkpointing functionality to the kubelet and other core components.

To enable the feature gate, we need to run the kubelet service with the flag — feature-gates=ContainerCheckpoint=true.

Building the solution

Let’s discuss the steps for creating our solution.

Creating an OpenFaaS function for CRIU snapshots

In our quest to bolster Kubernetes security through real-time forensics, we’ve integrated Falco and CRIU seamlessly. Now, let’s take a closer look at how we create an OpenFaaS function, aptly named “criu-snapshot,” to orchestrate this security dance. This function ensures that when Falco raises an alert, we can swiftly initiate a CRIU checkpoint.

To kick things off, we’ll create a new OpenFaaS function using the Python 3 runtime template. Before starting, make sure you have access to the `faas-cli` command-line tool to interact with OpenFaaS.

Run the following command for generating the skeleton structure of our function:

faas-cli new --lang python3 criu-snapshot

This command generates the initial structure for our function.

Now, let’s tailor the `handler.py` file within our newly created function to handle the orchestration. Below is the Python code for `handler.py`, which contains comments along the main operations performed.

import json
from kubernetes import config, client
import requests
import urllib3

def handle(req):
    event = json.loads(req)

    # Check if event should trigger checkpoint
    tags = event['tags']
    if not 'trigger-checkpoint' in tags:
        return

    # Extract required fields from falco output
    namespace = event['output_fields']['k8s.ns.name']
    pod_name = event['output_fields']['k8s.pod.name']
    container_id = event['output_fields']['container.id']

    if namespace == None or pod_name == None or container_id == None:
        return

    # Load k8s config
    config.load_incluster_config()

    # Create a Kubernetes API client instance
    api_instance = client.CoreV1Api()

    # Get container name given its id
    container_name = get_container_name_by_id(api_instance, pod_name, container_id, namespace)

    # Retrieve IP address of node
    host_ip = get_node_host_ip(api_instance, pod_name, namespace)

    # Access the service account token
    token = read_sa_token()

    # Define headers with the bearer token
    headers = {
        "Authorization": f"Bearer {token}",
    }

    checkpoint_url = "https://{}:10250/checkpoint/{}/{}/{}".format(host_ip, namespace, pod_name, container_name)

    try:
        # Disable SSL certificate warnings
        # NOTE: This is appropriate for test setups only
        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

        # Make the POST request to the TLS endpoint with headers
        response = requests.post(checkpoint_url, headers=headers, verify=False)

        # Check the response status code
        if response.status_code == 200:
            print("Checkpoint of container '{}' in pod '{}' in namespace '{}' has been created".format(container_name, pod_name, namespace))
            print(response.text)
        else:
            print(f"Request failed with status code: {response.status_code}")
            print("Response content:")
            print(response.text)

    except requests.exceptions.RequestException as e:
        print(f"Request error: {e}")

def read_sa_token():
    token_file_path = '/var/run/secrets/kubernetes.io/serviceaccount/token'

    try:
        with open(token_file_path, 'r') as token_file:
            token = token_file.read().strip()

        return token
    except Exception as e:
        print(f"Error reading token file: {e}")

def get_container_name_by_id(api_instance, pod_name, container_id, namespace="default"):
    try:
        # Get the pod's details
        pod = api_instance.read_namespaced_pod(name=pod_name, namespace=namespace)

        # Iterate through the containers in the pod's status and find the matching container ID
        for container_status in pod.status.container_statuses:
            if container_status.container_id.startswith(f"cri-o://{container_id}"):
                return container_status.name

    except Exception as e:
        print(f"Error: {e}")
        return None
    
def get_node_host_ip(api_instance, pod_name, namespace):
    try:
        # Get the pod's details
        pod = api_instance.read_namespaced_pod(name=pod_name, namespace=namespace)

        return pod.status.host_ip

    except Exception as e:
        print(f"Error: {e}")
        return None

When a security alert occurs, this function extracts relevant information about the affected container, connects to the Kubelet service, and triggers a checkpoint operation for the container where the security incident occurred. This checkpoint captures the container’s state for further investigation without interrupting its operation.

Deploying the function

With our function code ready, we need to deploy it to our Kubernetes cluster. Before proceeding, ensure that you have an image repository to push this function to (we stored it in DockerHub in our case).

In your `criu-snapshot.yml` file (which was generated using the initial faas-cli command and may differ depending on the name assigned to the function), you’ll need to modify the image section to point to your chosen registry.

A last step needs to be performed before deploying our serverless function: we need to install OpenFaaS in our cluster. If you’re new to OpenFaaS, refer to the official documentation for more details.

Once done we can proceed building and deploying the function, by using the following command:

faas-cli up -f criu-snapshot.yml

At this point we should see a new OpenFaaS pod being created in our cluster. This pod includes the container that is responsible for running our serverless function.

Configuring permissions for OpenFaaS service account

To allow OpenFaaS functions to call the Kubelet endpoint and initiate actions on the node, we must grant the necessary permissions to the service account used by the OpenFaaS pod. In this section, we’ll walk through the steps to set up the required permissions in your Kubernetes cluster.

Creating a cluster role

First, let’s create a Cluster Role that defines the specific permissions required. Save the following YAML configuration as `openfaas-role.yaml`:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: openfaas-role
rules:
- apiGroups: [""]
  resources: ["nodes/proxy"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get"]

You can create this Cluster Role by applying the configuration using the `kubectl` command:

kubectl apply -f openfaas-role.yaml

Creating a cluster role binding

Now that we have defined the permissions, we need to associate them with the OpenFaaS service account. Create a Cluster Role Binding by saving the following YAML configuration as `openfaas-role-binding.yaml`:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: openfaas-role-binding
subjects:
- kind: ServiceAccount
  name: default
  namespace: openfaas-fn
roleRef:
  kind: ClusterRole
  name: openfaas-role
  apiGroup: rbac.authorization.k8s.io

Apply the Cluster Role Binding configuration:

kubectl apply -f openfaas-role-binding.yaml

With these permissions in place, the OpenFaaS service account within the `openfaas-fn` namespace will have the necessary access to interact with the kubelet endpoint and trigger actions on the node.

The serverless function will be accessible from inside the cluster from the endpoint `http://criu-snapshot.openfaas-fn:8080`.

This setup ensures that your OpenFaaS functions can seamlessly communicate with Kubernetes components, facilitating real-time security response capabilities, such as those triggered by Falco alerts and orchestrated by CRIU.

Configuring Falco for HTTP output

To ensure that Falco alerts are seamlessly sent to the HTTP endpoint hosted by our OpenFaaS function, and that the output is properly formatted in JSON, we need to make some essential configurations in the `falco.yaml` file. These configurations ensure that Falco’s alerts are efficiently streamed to our function for real-time processing.

In the `falco.yaml` file, add or modify the following entries:

falco:
  http_output:
    enabled: true
    url: "http://criu-snapshot.openfaas-fn:8080"
  json_output: true

Since we’ve deployed Falco using the official Helm chart, you can include these configurations as values in a value file during deployment. This ensures that Falco is configured to interact with our OpenFaaS function seamlessly.

Creating a test rule

To validate the functionality of our security setup without triggering excessive Falco events, we’ve temporarily disabled all default Falco rules. This decision was made to streamline the testing process, as the default rules often generate a high volume of noisy alerts.

Instead, we’ve introduced a simple custom rule that triggers an alert whenever a shell is opened within a container. This test rule allows us to verify the end-to-end functionality of our security pipeline. Moreover, our test rule contains a custom tag trigger-checkpoint which is checked by the OpenFaaS function to decide whether a specific alert needs to trigger a checkpoint or not. The idea is that in a final implementation, only Falco alerts that need to trigger container checkpoint would have this tag.

Here’s the configuration for our custom rule:

customRules:
  custom-rules.yaml: |-
    - rule: shell_in_container
      desc: notice shell activity within a container
      condition: >
        evt.type = execve and 
        evt.dir = < and 
        container.id != host and 
        (proc.name = bash or
        proc.name = ksh or
        proc.name = sh)
      output: >
        shell in a container
        (user=%user.name container_id=%container.id container_name=%container.name 
        shell=%proc.name parent=%proc.pname cmdline=%proc.cmdline)    
      priority: WARNING
      tags: [trigger-checkpoint]

As for the previous Falco configuration, for our custom rule configuration, if you’re using the Falco Helm chart for deployment, you can seamlessly introduce your custom rule via a value file during the Falco Helm deployment process.

Testing the final implementation

Once that everything is setup and deployed, we can proceed testing the final implementation of our solution. We will trigger a Falco alert and verify that a CRIU checkpoint has been successfully created and it is stored on the node’s filesystem.

Here’s how we do it:

1. Deploy a simple Nginx server pod within your Kubernetes cluster.

2. To trigger the Falco alert, open a shell session within the Nginx pod.

3. Observe the Falco logs, where you’ll see the alert triggered by the shell activity:

{"hostname":"falco-lllbt","output":"12:15:43.935534493: Warning shell in a container (user=root container_id=c70e9c3d3a15 container_name=<NA>  shell=sh parent=runc cmdline=sh -c command -v bash >/dev/null && exec bash || exec sh) k8s.ns=default k8s.pod=webserver container=c70e9c3d3a15","priority":"Warning","rule":"shell_in_container","source":"syscall","tags":["trigger-checkpoint"],"time":"2023-09-28T12:15:43.935534493Z", "output_fields": {"container.id":"c70e9c3d3a15","container.name":null,"evt.time":1695903343935534493,"k8s.ns.name":"default","k8s.pod.name":"webserver","proc.cmdline":"sh -c command -v bash >/dev/null && exec bash || exec sh","proc.name":"sh","proc.pname":"runc","user.name":"root"}}
Checkpoint of container 'webserver' in pod 'webserver' in namespace 'default' has been created
{"items":["/var/lib/kubelet/checkpoints/checkpoint-webserver_default-webserver-2023-09-28T12:15:46Z.tar"]}

4. Check the node’s filesystem, where you’ll find the CRIU checkpoint stored in a designated location:

root@lima-k8s-cluster:/home/mirko.linux# ls -la /var/lib/kubelet/checkpoints/checkpoint-webserver_default-webserver-2023-09-28T12:15:46Z.tar
-rw------- 1 root root 8435712 Sep 28 12:15 /var/lib/kubelet/checkpoints/checkpoint-webserver_default-webserver-2023-09-28T12:15:46Z.tar

The tar file contains the image dump of the container, capturing all the information necessary for restoring the container in a forensic environment. While this aspect won’t be covered in this blog post, it serves as a topic for the next part of the article.

This successful test reaffirms the power of our Kubernetes security strategy, combining real-time alerting, orchestrated responses, and non-disruptive forensics to safeguard your containerized applications.

Conclusion

Building an Incident Response Engine in Kubernetes is a practical step to enhance security within containerized environments. This post demonstrated how integrating Falco, CRIU and OpenFaaS can result in a streamlined and effective approach to identifying and responding to security threats.

This approach allows for quick detection of malicious activities through Falco, the ability to take snapshots of the running containers with CRIU, and an automated response mechanism powered by OpenFaaS, providing a basic, yet effective level of security to your Kubernetes clusters.

Implementing such a system requires careful consideration and adjustment to fit your organization’s specific environment and needs. Regularly updating and refining your approach is crucial to adapt to the evolving cybersecurity landscape and to protect your resources effectively. Keep it simple, stay secure!

About Fraktal

Fraktal is a cyber security professional services company. Headquartered in Helsinki, Finland, the company provides consultancy services in product security areas such as software security, firmware security and electronics. For more see www.fraktal.fi.