Pages

Thursday, 4 May 2017

Plotting Binomial Data

I’m finely “passionate” enough to write a new blog post after a year of pseudo productivity on my thesis. That or my funding has run out and I’m just procrastinating more than ever. One or the other.


Anyway, I just returned from the European Cetacean Society in Denmark, which was lovely. Seriously, they have ginger flavoured Pepsi, no wonder Danes are rated the happiest people on earth (despite the high suicide rates…). In general, it was a great conference. Lots of good work, nice people and a prodigious amount of adult beverages.


One thing I did notice, however, was a disturbing trend with plots in both the oral and poster presentations. Many, people did not plot both their data and the model results. Gasp! The horror!

 
Not plotting data with predictions 


In all seriousness though, it’s important to show 1) the data 2) the model and 3) the confidence intervals of both (where applicable). It’s also important to do this on the scale that makes sense for your audience. The reasons for this are numerous. First, humans are visual, we can pick up patterns very quickly. Second, properly presenting data allows both the author and the reader to almost instantaneously assess the how well the model fits the data. Plotting data properly throughout the analysis phase also allows the author to “idiot check” the results during the analysis process. This can prevent headaches or embarrassment later on.


I get it learning R is an uphill battle, I’ve been there. Analysing complex, messy data is a challenge to begin with and adding an extra output to deal with is frustrating. This is particularly true for anyone who isn’t wildly in love with stats or coding (many biologists). The challenge is even worse with binomial data which are common outputs in bioacoustic surveys. Still, plotting your data properly throughout will make your life easier.


The goal of this post is to slowly walk through the entire process of plotting binomial data with model outputs. Here I use simulated data such that it should be the same for everybody and attempt to tie it back to biological principles at each step. We will model the output of simulated data using a generalised additive model (GAM). This is a fairly advanced technique so there will necessarily be more jargon in this post than I would ideally like to include. Apologies in advance.


As you follow along I strongly encourage you to write each line of code yourself rather than copy/paste. This process will help reinforce the commands and ultimately make you a better coder by allowing you to start to de-bug your own work.