Scalable logging infrastructure

For any OPs team one of the goals is to deploy services with as little operational overhead as possible. This means designing infrastructure that is resilient and self-healing. Recently I had an opportunity to apply these principles to part of $EMPLOYERS core internal services; our logging stack.

The first part of the task was to identify what properties we required of this system:

  • Scalable
    • Being logging infrastructure this system needed to support high volumes of ingest data as well as querying data at rest.
  • Highly Available
    • We should be at least able to queue logs to later ingest if some part of the system isn’t working.
  • Resilient storage
    • Some logs, like security logs, we may need to keep for long periods of time in a secure manner.
  • Low operational overhead
    • Though we need this system to be available we don’t want to spent to much time keeping it running.

This is what we ended up with:

This gives us all of the features we are looking for. Within reasonable limits all of these components are scalable to very large data volumes. By running ECS cluster on top of an ASG spread across AZs we have a high degree of availability for all components. S3, as we all know, has a somewhat absurd number of 9s around data durability and we can leverage versioning, access logs as well as encryption at rest to ensure data integrity. Kinesis was the newest piece to me, highly scalable and fast, I think it will be core to similar infrastructures for me in the future.

All of this was of course rolled up into a reusable Terraform module. This allows our team to roll this out for different environments in a reproducible manner.

Using Falco to secure Docker containers

I’ve been looking for an excuse to use Sysdig since I first heard about it at a bsides BOS conference a couple of years ago. This came to a head recently while at DockerCon in Seattle.

While there I had a chance to look at the new Sysdig monitoring for Docker which is pretty damn cool. The more interesting piece though was their newly opensourced tool Falco. Falco, in a nut shell, lets to create rules that will monitor for and alert on basically anything that is happening on your linux system. So now I needed to find a reason to use it.

This brings us back to Docker. The security of Docker is something that is very interesting to me. One feature which I’ve looked at before but always had a hard time figuring out how to use well is Seccomp. Seccomp profiles with Docker allow you to limit what syscalls your Docker container is allowed to make to the underlying system. By default Docker disables about 40 out of 300 syscalls. That’s pretty good, but lets see if we can leverage Falco to do better.

Let’s start with Falco. Falco is container aware which means we can create a rule really simply.

1
2
3
4
5
- rule: container_syscall
desc: Capture syscalls for any docker container
priority: WARNING
condition: container.id != host and syscall.type exists
output: "%container.id:%syscall.type"

This rule simply states, if you see a container making a syscall log the container’s id and the syscall that is being made. With that rule in place, and Falco configured to log JSON, we can start the Falco daemon and running a Nginx Docker container gives us the following output:

1
2
3
{"output":"15:07:59.736531417: Warning 7a68a113b2a4:clone","priority":"Warning","rule":"container_syscall","time":"2016-07-01T19:07:59.736531417Z"}
{"output":"15:07:59.736536327: Warning 7a68a113b2a4:set_robust_list","priority":"Warning","rule":"container_syscall","time":"2016-07-01T19:07:59.736536327Z"}
{"output":"15:07:59.736536539: Warning 7a68a113b2a4:set_robust_list","priority":"Warning","rule":"container_syscall","time":"2016-07-01T19:07:59.736536539Z"}

That’s awesome! We now have a log of all the syscalls being made by that Docker container. So what do we do with this?

To help with this task I’ve released falco2seccomp. This is a pretty simple Go project written to parse the output from Falco and generate a ready to go seccomp profile. Here’s example usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
falco2seccomp -log events.log -container-id 7a68a113b2a4
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86"
],
"syscalls": [
{
"name": "set_robust_list",
"action": "SCMP_ACT_ALLOW",
"args": []
},
{
"name": "gettid",
"action": "SCMP_ACT_ALLOW",
"args": []
},
...
]
}

This is a ready to use out of the box seccomp profile for our nginx container, limited to just the syscalls we actually saw the container using. Instead of blacklisting 40 syscalls, we’re only allowing 41. Again, awesome.

So how can you use this in the real world? One thought would be to integrate it into a CI/CD pipeline. If your builds are being done in a Docker container your tests should be exercising your code enough to generate the full list of syscalls required. boom.