How sleep deprivation reduced our logging stack costs

Apparently, if our infrastructure team leader’s pregnant wife can’t sleep, that means he also shouldn’t. Little did he know it will lead to an extremely productive night.

 

As all good stories start, this story also starts on a late Friday night.
All day long I was thinking about how to reduce the cost of our current logging stack, and at the same time improve the system.

 

Just to clarify, our previous stack included:

  • Redshift—as logs data warehouse
  • Custom self-made web dashboard—to view the logs
  • SQS—Async logs processing queue
  • S3—Saves the body of the log and its stacktrace
  • EC2 Instances * 5—Read the messages from SQS and save the data to redshift (meta fields: loglevel, short log description, timestamp and more) and S3 (error body).

 

So at 3am I launched a Linux EC2 instance and installed the classic log architecture: Logstash, Elasticsearch and Kibana, also known as ELK, with a few “wiring” configurations. An hour later I had a working logging system.

 

Now I only had to think of how to recreate the architecture in scale for a production environment while trying to stay as “AWS managed” as possible and with minimum custom code. So I combined the following services:

  • AWS ElasticSearch—managed ElasticSearch that comes with kibana built in
  • Kinesis Firehoselogstash replacement which allows me to stay async to the flow and has built in support to elasticsearch (and a few more tricks).

 

A few minutes later I had a running EKK stack that was fully AWS managed, but I found out that I need to solve a few issues before it’s fully production ready:

  1. Firehose, doesn’t support datetime field mapping. As a result kibana did not display dates and couldn’t do timestamp queries.
  2. Retire old data in ElasticSearch (TTL)
  3. Support Login Authentication to kibana
  4. Alert support in case something fails in the stack
  5. Analyze logs and create custom alerts.

 

That’s when the fun part started..

  1. Datetime field mapping—the fix is simple. I added a field named “timestamp” to the json that contains the value of epoch milliseconds. I defined a template in elasticsearch that maps all indexes named “logs-*” and predefined the field named “timestamp” as “datetime – epochms”. Problem solved.
  2. ElasticSearch Old data retire—Firehose insert the data to elasticsearch with daily index with the following name pattern: {indexname}-yyyy-MM-dd. For example: logs-2017-06-13. In order to keep N days all I needed to do is delete indexes bigger than N days. That can be done using simple Lambda that sends a http request to delete a retire index (method: DELETE url: {elasticsearch-url}/{full-index-name}. And a simple cloudwatch daily trigger rule that runs the Lambda and the data is retired as wanted.
  3. Kibana Login Authentication—Launched a linux EC2 instance and installed nginx and created a user. Defined a reverse proxy that all requests coming to the nginx will be redirected to kibana url but before nginx will ask for basic authentication. Next I assigned an ElasticIP to the instance and added the ElasticIP to elasticsearch access policy.
  4. Defined basic alerts using Firehose monitoring and S3 bucket to alert whenever a failed Firehose creates file (Firehose dump log files to S3 when he fail to insert ElasticSearch)
  5. Analyzing logs and custom alerts—Added Kinesis analytics to analyze the stream and give real-time stream analyses and defined alerts based on this.

 

That’s it for now… doing all I have written got me a scalable logging system that is fully manageable by AWS. To conclude, here’s a diagram of the system:

 

 

 

Nadav Steiner

Infrastructure Team Leader

Leave a Reply