Centralized Logs With logstash

A while back in one of our daily standup meetings our Chief Architect used the time to stress a topic of importance. These are my notes from that meeting.




Since then we've made sure to treat visibility as a first-class concern. Recently I've made time to experiment with logstash as a means to gaining some visibility into all the logs scattered about our distributed system.  There is some really valuable information in there. And who likes the headache of figuring out which process on which VM has the details you're looking for?  Or who likes the amount of time that can take?  Not me.

logstash, combined with Elasticsearch and Kibana 3 can erase those headaches. Apply directly to the infra.

logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster.

Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn't important. Sensitive data can be masked.  Messages can be tagged. The list goes on.

And it isn't limited to the File -> Shipper -> Broker -> Elasticsearch example. logstash ships with over 30 input plugins. Input can be read from files, Log4j sockets, TCP, UDP, Syslog, STOMP, even IRC and Twitter. Output can be sent to a slightly longer list including Elasticsearch, email, HTTP, Mongo, and even JIRA.

logstash is an incredibly powerful tool. It's open source and distributed under the Apache 2 license. It can meet large or small scaling needs. The entire stack mentioned above can even be run stand-alone, which can be nice for demos or I imagine even managing logs on a developer workstation.

Many excellent setup walk-throughs are available. My preferred ones are Centralized Setup with Event Parsing and the one in The Logstash Book.