Twelve-Factor Apps, Sidecars, and Big Data

Heroku’s twelve-factor methodology is a way of designing web applications that are portable across environments and different cloud providers. Sidecars are a design pattern which make applications more portable within your own infrastructure. Both of these can be applied to the Big Data world and we have a lot we can learn from them.

Historically in the big data world we have run a large number of long-running backing services which collectively we call Hadoop. These services provide execution frameworks, SQL engines, in-memory data stores, and file stores. These services are generally run by an operations team, along with gateway hosts which have client configuration in place to connect to these services. A separate developer team will write short-running applications which connect to these backing services and move data around into a shape that data analysts can use. The core Hadoop services are all easy to find because they have standard client configurations, but many of the newer applications and non-core services do not have an easy to find entry point and so there is some disconnect between the developers and operations teams.

The shift to cloud, serverless, and streaming-first architectures is changing the ways we work, but there remains this disconnect between developers and operations teams. In fact, it’s getting worse. With containers and serverless applications we no longer have the provided gateway servers with the required configurations in place, and developers now have to maintain the configuration for where the services are that another team provides.

By taking on the lessons from both twelve-factor apps and sidecar patterns we can improve the way we work, making it easier and less frustrating for everybody.

The sidecar pattern is relatively new, and fits well with self-managed infrastructure and good Service Discovery tooling and patterns. With the sidecar pattern, we build our applications to run in containers and alongside those containers we have separate containers providing a proxy that is aware of the service discovery systems we have in place. For any outbound connections from our applications we connect to the proxy alongside it and that will route traffic to the actual services. For this to work, we need the operations and infrastructure teams to provide good service discovery and to register all of their backing services in service discovery; this is no small ask, and if done badly will make it harder to move between cloud providers or container solutions. Nevertheless, when done well at large scale across the entire company this is a nice pattern for building applications.

The twelve-factor methodology also requires the support of operations and infrastructure teams. However, there is less overhead as no sidecar containers are required and it is up to the individual teams as to how they provide services which adhere to the 12 factors. There is also more onus on the developers to adhere to certain standards of how applications are built.

Let’s run through these twelve factors and how they might apply to big data applications.

  1. One codebase tracked in revision control, many deploys - there should be a one-to-one mapping of codebases to apps, and the codebase can be deployed in many locations. Immediately the definition of app comes into question in the big data world, are multiple pipelines a single application? We have traditionally used a large monorepo with all data pipelines existing in a single repository, and whilst this works well at small scale we need to rethink how we separate pipelines at larger scale. This is even more true when we start mixing different styles of applications in a single repository. With the move towards streaming applications and microservices rather than data pipelines it’s becoming clearer to us where the boundaries of applications are.

  2. Explicitly declare and isolate dependencies - ie. do not rely on there being system libraries in place, but declare all dependencies within your application. This has been best-practise for a long time, and yet it is at-odds with some of the commercial Hadoop distributions where you are expected to use their provided versions of Spark in order to benefit from all of the data lineage and security. This-aside, whilst some data engineers prefer all required libraries to be made available for them, we see doing this as an anti-pattern and require all applications to declare dependencies. This is especially important with multitenant Hadoop clusters where we cannot expect operations teams to provide every single library that every single team wants to use. The increasing usage of containers for running microservices forces us into declaring dependencies and we are seeing the benefits of this.

  3. Store config in the environment - storing shared configuration in application codebases is a violation of twelve-factor and makes it difficult to move between environments and datacenters. This also links into the first factor of having one codebase per app. If we are duplicating configuration in each repository then it becomes hard to change it. We should not be storing hostnames or passwords in application codebases. We have previously relied on operations teams to provide configuration files on gateway hosts stored in predictable locations in /etc. However, this requires the operations teams to maintain configuration that they do not care about and it can be slow to change these files if change management processes apply. The use of microservices in containers allows for application developers to deal with configuration themselves without involving other teams, however it is important we start providing service discovery and secret management services otherwise we will quickly end up in a mess.

  4. Treat backing services as attached resources - there should be no code changes required to replace a failed database server with a new one, and the code should not care where this database server runs. For core Hadoop services this means using the system-provided configuration files rather than your own, for other services it means reading locations of services from environment variables. This is closely linked to factor three, if we do not store configuration in the codebase we aren’t tieing our applications to specific instances of databases and this gives the database administrators more freedom to move their services without affecting our apps. We need good service discovery tools to be able to adhere to this.

  5. Strictly separate build, release and run stages - in the case of Java and Scala applications, the fact that these are compiled languages mean we inherently require a build step and it has been best practise for a long time to store artifacts in an artifact repository like Nexus or Artifactory. This factor cements this, and makes it a requirement for all applications - not just compiled ones. Big data applications are no different to any application in this regard. It is important to note that the build step should not be for a specific environemnt, the environment-specific configuration should only be applied at release stage.

  6. Execute the app as one or more stateless processes - a twelve-factor app never assumes that anything will be cached on-disk or in-memory for subsequent transactions. Whilst this makes development much easier, in the case of high-throughput streaming applications we often violate this rule in practise by storing local state of lookups and running counts. However, where we can avoid storing local state we should do so.

  7. Export services via port binding - the twelve-factor app is self-contained, if we need to expose anything on the network then we should do so directly rather than rely on external services like Tomcat. In the modern world of microservices this should be standard procedure anyway, although we frequently expose data by pushing it onto message buses like Kafka rather than exposing data directly.

  8. Concurrency - scale out by adding additional processes. For realtime streaming applications this is relatively easy, assuming the incoming data is sharded somehow. For more batch-orientated applications we cannot always have separate application processes running, but in the case of Spark and Yarn applications this factor can be translated into running more containers rather than larger containers. Keeping our containers small makes us good citizens, and we can use resource pools to control whether we want jobs to complete quickly with lots of containers or take longer but leave more resources for others.

  9. Maximise robustness with fast startup and graceful shutdown - applications should be disposable, and able to be stopped and started at a moments notice. Hadoop is designed so that we can stop individual worker servers and the applications should continue running, however so often our own code does not handle this well. Our use of Java and preferring short-lived batch applications further holds us back in this regard. However, by defensive programming and storing state in HBase or similar we can do a lot better. Newer streaming technologies like Kafka that guarantee exactly-once delivery and the frameworks that are starting to take advantage of this will make a big difference in our ability to adhere to this factor in the future.

  10. dev/prod parity - keep development, staging and production as similar as possible. Because of the difficulty in running large development and test Hadoop environments many organisations simply don’t, and instead run their test and development applications on production clusters. Whilst this mostly solves the issues of keeping dev and prod parity, it does go against so many recommendations - even if adequate security is in place. More recently, the introduction of Spark and the move to more streaming-first microservices is making it easier to run development environments locally but we need to take care to follow the other factors to help us maintain parity between these environments.

  11. Treat logs as event streams - a twelve-factor app should never concern itself with routing of logs. Rather than writing to log files, we should take the output of stdout. For Spark and Yarn applications running in distributed mode we already have jobhistory servers that do a good job of collecting logs and metrics, but we need the correct configuration in place to use this. For standalone microservices we typically use containers and rely on the container orchestration service to manage logs. Where we run standalone applications not in containers we often do a poor job of log management, often relying on log4j or syslog; these are far from ideal, but can be made to work if we maintain them well.

  12. Admin processes - these should follow all of the twelve-factors just like the main app. They should be in the same codebase as the app, and specific released versions should be run in an identical environment with the same dependency management. This guarantees repeatability and reduces surprises.

We’ve mentioned or alluded to service discovery multiple times in this post, and if you take one thing away it should be this - that introducing service discovery early on will save you from many issues in the future. The introduction of Consul for service discovery and Vault for secret management should be relatively trivial for a team capable of running a big data platform, and is something your operations team and developers should investigate together in order to save yourself from future problems.