Why is my test failing when I use cast?

November 13th, 2014

A little while ago Paul Wilson tweeted about some troubles he was having while converting an example from the Erlang and OTP in Action book to Elixir and adding tests. I’d just been doing the same thing and I wanted to expand on my reply to him here.

Background

The example in the book is a caching application that uses one Erlang process per element to store the cached value. There’s a mapping from the key to the pid of the process holding the data.

This design works well because processes are cheap and easy to distribute in Erlang. The timeout code provided as part of GenServer is also handy for implementing cache expiry.

The problem

The test is:

test "delete works" do
  SimpleCache.insert :test, self
  assert { :ok, self() } == SimpleCache.lookup :test
  SimpleCache.delete :test
  assert { :error, :not_found } == SimpleCache.lookup :test
end

(code here)

i.e. Insert a value into the cache, assert it’s present, delete it from the cache, assert it’s gone. The test failure is:

 1) test delete works (SimpleCacheTest)
    test/simple_cache_test.exs:24
    ** (exit) exited in: GenServer.call(#PID<0.276.0>, :fetch, 5000)
        ** (EXIT) normal
    stacktrace:
      (elixir) lib/gen_server.ex:356: GenServer.call/3
      (simple_cache) lib/simple_cache.ex:36: SimpleCache.lookup/1
      test/simple_cache_test.exs:28

So what’s happening?

Because the implementation code uses cast for the delete (as you can see here) there’s a race condition. The server for the key that’s just been deleted isn’t dead but it’s on it’s way out and we call lookup from our test before it’s had a chance to properly terminate. The message we get back is the exit message that the supervisor should be handling.

Tracing

I’m still learning the fundamentals of Elixir, Erlang, and OTP so I wanted to verify my theory above was true. Thanks to the fantastic Erlang in Anger book I knew a little bit about how to trace messages. After a little messing about I found that adding:

  test "delete works" do
    :dbg.tracer
    :dbg.p self()
    :dbg.p :new
    # ...
  end

To the test and:

config :logger, :handle_otp_reports, true
config :logger, :handle_sasl_reports, true

To config/test.exs allowed me to see the messages and have them formatted in reasonably readably. I’ve added recon and Dbg to my list of things to checkout but this is enough for now.

The raw output with the corresponding calls:

SimpleCache.insert :test, self
(<0.199.0>) <0.160.0> ! {'$gen_call',{<0.199.0>,#Ref<0.0.0.1681>},
                                     {start_child,[<0.199.0>,86400]}}
(<0.202.0>) init ! {<0.202.0>,{get_argument,generic_debug}}
(<0.202.0>) << {init,error}
(<0.202.0>) <0.160.0> ! {ack,<0.202.0>,{ok,<0.202.0>}}
(<0.199.0>) << {#Ref<0.0.0.1681>,{ok,<0.202.0>}}

So process <0.199.0> is our test process and <0.202.0> is the SimpleCache.Element process created to hold the value.

assert { :ok, self() } == SimpleCache.lookup :test
(<0.199.0>) <0.202.0> ! {'$gen_call',{<0.199.0>,#Ref<0.0.0.1682>},fetch}
(<0.202.0>) << {'$gen_call',{<0.199.0>,#Ref<0.0.0.1682>},fetch}
(<0.202.0>) <0.199.0> ! {#Ref<0.0.0.1682>,{ok,<0.199.0>}}
(<0.199.0>) << {#Ref<0.0.0.1682>,{ok,<0.199.0>}}

Now the cast:

SimpleCache.delete :test
(<0.199.0>) <0.202.0> ! {'$gen_cast',delete}
(<0.202.0>) << {'$gen_cast',delete}

But before the server gets a chance to process the cast fetch is called again via the lookup API function.

assert { :error, :not_found } == SimpleCache.lookup :test
(<0.199.0>) <0.202.0> ! {'$gen_call',{<0.199.0>,#Ref<0.0.0.1683>},fetch}
(<0.202.0>) << {'$gen_call',{<0.199.0>,#Ref<0.0.0.1683>},fetch}

And instead of the expected reply we get the DOWN message because the server has just exited.

(<0.199.0>) << {'DOWN',#Ref<0.0.0.1683>,process,<0.202.0>,normal}

Which causes the test to fail.

Solution

So far I’ve come across three ways to get the test to pass, one bad and two good. Which one to pick depends a little on how your code needs to work.

1. Sleep in the test (bad)

A naive way to make the test pass is to sleep long enough for the Element process to terminate.

Adding :timer.sleep 100 after the delete call makes the test pass but this doesn’t actually fix the problem and we don’t want slower tests just because we’re waiting for things to synchronise.

(There’s also no guarantee that any sleep we put in will be long enough so spurious test failures are possible. Extremely unlikely in this case but not in others.)

2. Convert cast to call

A better way to make the test pass is to convert the asynchronous cast to a synchronous call so the delete won’t return until the process has terminated. You can see the commit that does that here.

diff --git a/apps/simple_cache/lib/simple_cache/element.ex b/apps/simple_cache/lib/simple_cache/element.ex
index 35f3823..c96975b 100644
--- a/apps/simple_cache/lib/simple_cache/element.ex
+++ b/apps/simple_cache/lib/simple_cache/element.ex
@@ -24,7 +24,7 @@ defmodule SimpleCache.Element do
 
   def replace(pid, value), do: GenServer.cast(pid, { :replace, value })
 
-  def delete(pid), do: GenServer.cast(pid, :delete)
+  def delete(pid), do: GenServer.call(pid, :delete)
 
   ## Callbacks
 
@@ -53,7 +53,7 @@ defmodule SimpleCache.Element do
     { :reply, { :ok, value }, state, time_left(start_time, lease_time) }
   end
 
-  def handle_cast(:delete, state), do: { :stop, :normal, state }
+  def handle_call(:delete, _from, state), do: { :stop, :normal, :ok, state }
 
   def handle_cast({:replace, value}, %State{start_time: start_time, lease_time: lease_time} = state) do
     { :noreply, %{state | value: value}, time_left(start_time, lease_time) }

(It’s vitally important to add the :ok as the third element of the return tuple in the handle_call or OTP won’t go through the correct code path.)

So is this a good solution? In this case I think it is. The test highlighted a race condition that could happen to client code as well and this will prevent it.

(Disclaimer: I know I’m not good at thinking concurrently yet so I could be wrong or missing something here. Please let me know if I am!)

That said, Chapter 9 of Erlang and OTP in Action makes a good case for why you want to use cast when you can and I’d highly recommend reading it.

There’s another good example of this race condition in the Elixir Mix and OTP guide on the ETS page here that’s definitely worth reading too.

3. Using events

I came across one more potential solution as I worked through the book, it’s using GenEvent to signal when a process has performed the action being tested. You can see an example of this in the test I wrote for the ResourceDiscovery application from the book.

defmodule ResourceDiscoveryTest do
  use ExUnit.Case

  defmodule TestEventHandler do
    use GenEvent

    def add_handler(parent) do
      ResourceDiscovery.Event.add_handler(__MODULE__, [parent])
    end

    def remove_handler do
      ResourceDiscovery.Event.remove_handler(__MODULE__, [])
    end

    def handle_event(:resources_traded, [parent] = state) do
      send parent, :resources_traded
      { :ok, state }
    end
  end

  setup do
    TestEventHandler.add_handler(self)
    on_exit fn ->
      TestEventHandler.remove_handler
    end
  end

  test "can add a local resource" do
    assert :ok = ResourceDiscovery.add_target_resource_type(:test)
    assert :ok = ResourceDiscovery.add_local_resource(:test, self)
    assert :ok = ResourceDiscovery.trade_resources
    assert_receive :resources_traded
    assert { :ok, Set.put(HashSet.new, self) } ==
           ResourceDiscovery.fetch_resources(:test)
  end
end

(code here)

This is testing aysnchronously trading knowledge about available resources so it doesn’t make sense to convert to a call. When resources have been traded the code fires an event. (code here)

The assert_receive in the test will timeout after 100ms if it doesn’t receive the message so the test will fail if something is wrong with the code. The test only waits until it gets the event before checking the system state is correct so it’s as fast as it can be.

It does feel like doing a lot just for a test in this case. If this was the only use case for the event the perhaps it would be possible to create a macro that only adds the event code when testing.

Summary

There are three ways I’ve discovered to fix this particular race condition in tests:

Sleeping

Not a great solution even if it works

Converting cast to call

If this fits your application then it’s the way to go. You shouldn’t do this just to make testing easier though as you want your system to be as decoupled as possible.

Using GenEvent to synchronise

This is a heavier solution as it requires another process, module, and more test code. I do find it easier to reason about and there’s a good chance the event logging will be useful in production. It may be the only solution when you know the code must use cast.

Further investigation

There’s a open source ebook on testing here that has pointers to many Erlang testing resources that could also be useful when testing Elixir code.

I plan to learn more about all the tools mentioned in that book as well as how ExUnit is implemented itself and I’ll share what I find here.