What are the different OTP supervisor restart strategies useful for?

April 28th, 2015

OTP Supervisors can restart their children in four different ways and to better understand which one to use in a given scenario I wanted to summarise a few patterns I’ve seen.

One for one

You will use this strategy at least once in every OTP application. It’s how you setup supervisor hierarchies and start the different workers and supervisors that don’t need to be pooled or started dynamically.

Simple one for one

You use this when you need to create and supervise processes of the same type. You can create a pool of processes when the supervisor starts or have them created when something happens. e.g. accepting a new tcp connection.

You can see this in action in the cache in phl.ink, a simple Phoenix url shortener I made.

defmodule Phlink.Cache.UrlCacheSupervisor do
  use Supervisor

  def start_link do
    Supervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  def start_child(url) do
    Supervisor.start_child(__MODULE__, [url])
  end

  def init([]) do
    children = [
      worker(Phlink.Cache.UrlCache, [])
    ]

    supervise(children, strategy: :simple_one_for_one)
  end
end

(source)

The children list in the init/1 function is a template for the child processes to create. The supervisor won’t actually start any children until you tell it to. When the Mapper process needs to cache the url associated with a short code it calls start_child(url) which returns the pid of the started process. The Mapper stores this pid in a dictionary so it can avoid a db lookup next time the shortcode is requested.

:simple_one_for_one and :one_for_one are the most common strategies used.

Rest for one

This is the strategy to use when processes have one way dependencies. An example would be an chain of processes providing data to each other in a pipeline. If you had five processes in the chain and the first one died the rest that depend on it must be restarted. If the third one dies you only need to restart the fourth and fifth, the first two can continue.

Another example of this would be a a registry process that maps some external input to the PID of the process that handles it.

defmodule Phlink.Cache.Supervisor do
  use Supervisor

  def start_link do
    Supervisor.start_link(__MODULE__, [])
  end

  def init([]) do
    children = [
      worker(Phlink.Cache.Mapper, []),
      supervisor(Phlink.Cache.UrlCacheSupervisor, [])
    ]

    supervise(children, strategy: :rest_for_one)
  end
end

(source)

If the Mapper process dies then there’s no way to get at the processes it was mapping too so :rest_for_one will terminate the supervisor which will terminate the children caching the urls.

One for all

This is the strategy to use when you start several processes that depend on each other to get the work done. An example of this would be a sync process that talks to another node. When a new node is added you’d start sending and receiving processes locally to handle the two way synchronisation. If either the sender or receiver died you could bring everything back to a known state by using this strategy and having the supervisor restart both.

There’s an example of this in Riak Core.

init ([]) ->
    {ok,{ {one_for_all,10,10},
         [?CHILD(riak_core_handoff_receiver_sup,supervisor),
          ?CHILD(riak_core_handoff_sender_sup,supervisor),
          ?CHILD(riak_core_handoff_listener_sup,supervisor),
          ?CHILD(riak_core_handoff_manager,worker)
         ]}}.

(source)

Summary

The four restart strategies in OTP have different use cases and give you a lot of power in deciding how to structure your app and handle failure. You’ll mostly use :one_for_one and :simple_one_for_one, occasionally :rest_for_one, and rarely :one_for_all.

When to use one_for_one

When to use simple_one_for_one

You could have start_child guard against too many children but it’s worth considering something like poolboy for that.

Be careful about creating a “galloping herd” if you have many children that do something when they are terminated by the supervisor.

When to use rest_for_one

When to use one_for_all

Seen any other patterns in the wild? Add them to the comments as I’d like to hear about ones I haven’t seen yet!