## Calculating the Mode in R, missing values edition

It is not difficult to write a function in R to calculate the mode (as a central tendency). The function I have posted a while ago did not work with missing values. Here’s a tweak that does work with NA in the data.
```Mode = function(x) { ux = na.omit(unique(x)) return(ux[which.max(tabulate(match(x, ux)))]) } ```

## Using Splines to Fill Holes (Interpolation)

Following a comment I recently had on using moving averages, I wanted to share more widely how we can use splines to fill holes in (time series) data. Contrary to the moving-average based approach I posted earlier, using splines we can keep the observed values and just interpolate where there are missing values. What is more, in many cases splines are more appropriate, and actually surprisingly easy to use.

First we need some data with missing values (shamelessly nicked from Kirk, although modified to protect the original should this be necessary):

z <- c(-1.1484, -1.3842, -1.5985, -1.0626, -1.3413, -1.2341, -1.1269, NA, -0.7411, -0.7840, -0.6125, -0.8912, NA, NA, -1.1912, -1.7271, -1.0841, -0.9555, -0.9555, -0.6554, -0.4196, -0.5268, -0.3767, -0.2695, 0.2019, NA, NA, NA, 1.1880, 0.9736, 0.7807, 0.5878, 1.3594, 0.6306, 1.3809, NA, NA, NA, NA, NA, 1.5738, 1.6595, 1.1665, 0.9950, 1.7238, 1.1022, 1.1451, NA, 0.7807, NA, NA, NA, NA, NA, NA, NA, 0.8450, 0.5449, 0.2662, 0.8021, 0.4806, 0.1376, 0.5449, 0.2019, 0.4592)

First, we’re looking at the data.

plot(z, type="l", bty="n")

To identify the location of the missing values, we can use the following code (it also keeps things slightly simpler below):

miss <- which(is.na(z))

The core of the approach presented here is the splines function, part of R. It defines a function, here I called it a().

a <- splinefun(z)

We can now use this function to get individual points using splines to interpolate the curve. For instance, here I plot the values for each integer. Note that this plots on top of the observed values, but easily fills the holes. 65 is the length of the data vector.

points(1:65, a(1:65), col=2)

We can highlight the missing values (now interpolated) by plotting them in a different colour:

points(miss, a(miss), col=5)

And here’s how to identify a single point: the value for y at x=14.

points(14, a(14), col=3)

This figure has not much value except for demonstrating how splines can be used. The code here could be quite easily changed to replace missing values with interpolated ones.

To finish off, here’s a demonstration that the splines function is quite capable of dealing with fine-graded interpolations:

points(seq(1,65,.5), a(seq(1,65,.5)), col=3)

## Moving average to fill holes (interpolation)

One way moving averages can be used, is to fill holes in time series. Consider the standard situation at the top of the illustration: the moving average simply averages within a given window. The value for the dot in red is replaced by the average of all values in the red box; the value for the dot in green is replaced by the average of all values in the green box; and so on.

With holes in the data (situation at the bottom), we can use all the data available in a given window. This means that sometimes we use many numbers, and sometimes we have just a few. In this example, the average calculated for the red and green position is the same; with the green line indicating the here absent green dot.

For a numerical example, consider the following time series with plenty of holes as data:

`data <- c(NA,.223,NA,NA,.359,NA,NA,.302,NA,NA,NA,NA,.260,NA,.391)`

Using the following function, I simply average over the holes, using as much data as available in a given time window.

```namav <- function(x,k=3){ x <- c(rep(NA, k),x,rep(NA,k)) # add NA on both sides n <- length(x) return(sapply((k+1):(n-k), function(i) sum(x[(i-k):(i+k)],na.rm=TRUE)/(2*k+1-sum(is.na(x[(i-k):(i+k)]))))) }```

The value for k determines the width of the time window. The following graph illustrates how different values for k play out in these data. With k=0, only the actual data are shown. With k=1, the data are also applied to the point before and after; with k=2 we get a contiguous time series: Each point is the average of all available data points within a (moving) window that includes two values before and two values after, and the point itself, of course. The higher the value of k, the closer we get to the mean across all time points (i.e. a flat line).