Why is my test failing when I use cast?
November 13th, 2014
A little while ago Paul Wilson tweeted about some troubles he was having while converting an example from the Erlang and OTP in Action book to Elixir and adding tests. I’d just been doing the same thing and I wanted to expand on my reply to him here.
Background
The example in the book is a caching application that uses one Erlang process per element to store the cached value. There’s a mapping from the key to the pid of the process holding the data.
This design works well because processes are cheap and easy to distribute in Erlang. The timeout code provided as part of GenServer
is also handy for implementing cache expiry.
The problem
The test is:
"delete works" do
test SimpleCache.insert :test, self
:ok, self() } == SimpleCache.lookup :test
assert { SimpleCache.delete :test
:error, :not_found } == SimpleCache.lookup :test
assert { end
i.e. Insert a value into the cache, assert it’s present, delete it from the cache, assert it’s gone. The test failure is:
1) test delete works (SimpleCacheTest)
test/simple_cache_test.exs:24
** (exit) exited in: GenServer.call(#PID<0.276.0>, :fetch, 5000)
** (EXIT) normal
stacktrace:
(elixir) lib/gen_server.ex:356: GenServer.call/3
(simple_cache) lib/simple_cache.ex:36: SimpleCache.lookup/1
test/simple_cache_test.exs:28
So what’s happening?
Because the implementation code uses cast
for the delete
(as you can see here) there’s a race condition. The server for the key that’s just been deleted isn’t dead but it’s on it’s way out and we call lookup
from our test before it’s had a chance to properly terminate. The message we get back is the exit message that the supervisor should be handling.
Tracing
I’m still learning the fundamentals of Elixir, Erlang, and OTP so I wanted to verify my theory above was true. Thanks to the fantastic Erlang in Anger book I knew a little bit about how to trace messages. After a little messing about I found that adding:
"delete works" do
test :dbg.tracer
:dbg.p self()
:dbg.p :new
# ...
end
To the test and:
:logger, :handle_otp_reports, true
config :logger, :handle_sasl_reports, true config
To config/test.exs
allowed me to see the messages and have them formatted in reasonably readably. I’ve added recon and Dbg to my list of things to checkout but this is enough for now.
The raw output with the corresponding calls:
SimpleCache.insert :test, self
<0.199.0>) <0.160.0> ! {'$gen_call',{<0.199.0>,#Ref<0.0.0.1681>},
(<0.199.0>,86400]}}
{start_child,[<0.202.0>) init ! {<0.202.0>,{get_argument,generic_debug}}
(<0.202.0>) << {init,error}
(<0.202.0>) <0.160.0> ! {ack,<0.202.0>,{ok,<0.202.0>}}
(<0.199.0>) << {#Ref<0.0.0.1681>,{ok,<0.202.0>}} (
So process <0.199.0>
is our test process and <0.202.0>
is the SimpleCache.Element
process created to hold the value.
:ok, self() } == SimpleCache.lookup :test assert {
<0.199.0>) <0.202.0> ! {'$gen_call',{<0.199.0>,#Ref<0.0.0.1682>},fetch}
(<0.202.0>) << {'$gen_call',{<0.199.0>,#Ref<0.0.0.1682>},fetch}
(<0.202.0>) <0.199.0> ! {#Ref<0.0.0.1682>,{ok,<0.199.0>}}
(<0.199.0>) << {#Ref<0.0.0.1682>,{ok,<0.199.0>}} (
Now the cast
:
SimpleCache.delete :test
<0.199.0>) <0.202.0> ! {'$gen_cast',delete}
(<0.202.0>) << {'$gen_cast',delete} (
But before the server gets a chance to process the cast
fetch
is called again via the lookup
API function.
:error, :not_found } == SimpleCache.lookup :test assert {
<0.199.0>) <0.202.0> ! {'$gen_call',{<0.199.0>,#Ref<0.0.0.1683>},fetch}
(<0.202.0>) << {'$gen_call',{<0.199.0>,#Ref<0.0.0.1683>},fetch} (
And instead of the expected reply we get the DOWN
message because the server has just exited.
<0.199.0>) << {'DOWN',#Ref<0.0.0.1683>,process,<0.202.0>,normal} (
Which causes the test to fail.
Solution
So far I’ve come across three ways to get the test to pass, one bad and two good. Which one to pick depends a little on how your code needs to work.
1. Sleep in the test (bad)
A naive way to make the test pass is to sleep long enough for the Element
process to terminate.
Adding :timer.sleep 100
after the delete
call makes the test pass but this doesn’t actually fix the problem and we don’t want slower tests just because we’re waiting for things to synchronise.
(There’s also no guarantee that any sleep we put in will be long enough so spurious test failures are possible. Extremely unlikely in this case but not in others.)
2. Convert cast
to call
A better way to make the test pass is to convert the asynchronous cast
to a synchronous call
so the delete
won’t return until the process has terminated. You can see the commit that does that here.
diff --git a/apps/simple_cache/lib/simple_cache/element.ex b/apps/simple_cache/lib/simple_cache/element.ex
index 35f3823..c96975b 100644--- a/apps/simple_cache/lib/simple_cache/element.ex
+++ b/apps/simple_cache/lib/simple_cache/element.ex
@@ -24,7 +24,7 @@ defmodule SimpleCache.Element do
def replace(pid, value), do: GenServer.cast(pid, { :replace, value })
- def delete(pid), do: GenServer.cast(pid, :delete)
+ def delete(pid), do: GenServer.call(pid, :delete)
## Callbacks
@@ -53,7 +53,7 @@ defmodule SimpleCache.Element do
{ :reply, { :ok, value }, state, time_left(start_time, lease_time) }
end
- def handle_cast(:delete, state), do: { :stop, :normal, state }
+ def handle_call(:delete, _from, state), do: { :stop, :normal, :ok, state }
def handle_cast({:replace, value}, %State{start_time: start_time, lease_time: lease_time} = state) do { :noreply, %{state | value: value}, time_left(start_time, lease_time) }
(It’s vitally important to add the :ok
as the third element of the return tuple in the handle_call or OTP won’t go through the correct code path.)
So is this a good solution? In this case I think it is. The test highlighted a race condition that could happen to client code as well and this will prevent it.
(Disclaimer: I know I’m not good at thinking concurrently yet so I could be wrong or missing something here. Please let me know if I am!)
That said, Chapter 9 of Erlang and OTP in Action makes a good case for why you want to use cast
when you can and I’d highly recommend reading it.
There’s another good example of this race condition in the Elixir Mix and OTP guide on the ETS page here that’s definitely worth reading too.
3. Using events
I came across one more potential solution as I worked through the book, it’s using GenEvent
to signal when a process has performed the action being tested. You can see an example of this in the test I wrote for the ResourceDiscovery
application from the book.
defmodule ResourceDiscoveryTest do
use ExUnit.Case
defmodule TestEventHandler do
use GenEvent
def add_handler(parent) do
ResourceDiscovery.Event.add_handler(__MODULE__, [parent])
end
def remove_handler do
ResourceDiscovery.Event.remove_handler(__MODULE__, [])
end
def handle_event(:resources_traded, [parent] = state) do
:resources_traded
send parent, :ok, state }
{ end
end
do
setup TestEventHandler.add_handler(self)
fn ->
on_exit TestEventHandler.remove_handler
end
end
"can add a local resource" do
test :ok = ResourceDiscovery.add_target_resource_type(:test)
assert :ok = ResourceDiscovery.add_local_resource(:test, self)
assert :ok = ResourceDiscovery.trade_resources
assert :resources_traded
assert_receive :ok, Set.put(HashSet.new, self) } ==
assert { ResourceDiscovery.fetch_resources(:test)
end
end
This is testing aysnchronously trading knowledge about available resources so it doesn’t make sense to convert to a call
. When resources have been traded the code fires an event. (code here)
The assert_receive
in the test will timeout after 100ms if it doesn’t receive the message so the test will fail if something is wrong with the code. The test only waits until it gets the event before checking the system state is correct so it’s as fast as it can be.
It does feel like doing a lot just for a test in this case. If this was the only use case for the event the perhaps it would be possible to create a macro that only adds the event code when testing.
Summary
There are three ways I’ve discovered to fix this particular race condition in tests:
Sleeping
Not a great solution even if it works
Converting cast
to call
If this fits your application then it’s the way to go. You shouldn’t do this just to make testing easier though as you want your system to be as decoupled as possible.
Using GenEvent to synchronise
This is a heavier solution as it requires another process, module, and more test code. I do find it easier to reason about and there’s a good chance the event logging will be useful in production. It may be the only solution when you know the code must use cast
.
Further investigation
There’s a open source ebook on testing here that has pointers to many Erlang testing resources that could also be useful when testing Elixir code.
I plan to learn more about all the tools mentioned in that book as well as how ExUnit
is implemented itself and I’ll share what I find here.