Using Splines to Fill Holes (Interpolation)

Following a comment I recently had on using moving averages, I wanted to share more widely how we can use splines to fill holes in (time series) data. Contrary to the moving-average based approach I posted earlier, using splines we can keep the observed values and just interpolate where there are missing values. What is more, in many cases splines are more appropriate, and actually surprisingly easy to use.

First we need some data with missing values (shamelessly nicked from Kirk, although modified to protect the original should this be necessary):
splines1

z <- c(-1.1484, -1.3842, -1.5985, -1.0626, -1.3413, -1.2341, -1.1269, NA, -0.7411, -0.7840, -0.6125, -0.8912, NA, NA, -1.1912, -1.7271, -1.0841, -0.9555, -0.9555, -0.6554, -0.4196, -0.5268, -0.3767, -0.2695, 0.2019, NA, NA, NA, 1.1880, 0.9736, 0.7807, 0.5878, 1.3594, 0.6306, 1.3809, NA, NA, NA, NA, NA, 1.5738, 1.6595, 1.1665, 0.9950, 1.7238, 1.1022, 1.1451, NA, 0.7807, NA, NA, NA, NA, NA, NA, NA, 0.8450, 0.5449, 0.2662, 0.8021, 0.4806, 0.1376, 0.5449, 0.2019, 0.4592)

First, we’re looking at the data.

plot(z, type="l", bty="n")

To identify the location of the missing values, we can use the following code (it also keeps things slightly simpler below):

miss <- which(is.na(z))

The core of the approach presented here is the splines function, part of R. It defines a function, here I called it a().

a <- splinefun(z)

We can now use this function to get individual points using splines to interpolate the curve. For instance, here I plot the values for each integer. Note that this plots on top of the observed values, but easily fills the holes. 65 is the length of the data vector.

points(1:65, a(1:65), col=2)

We can highlight the missing values (now interpolated) by plotting them in a different colour:

points(miss, a(miss), col=5)

And here’s how to identify a single point: the value for y at x=14.

points(14, a(14), col=3)

This figure has not much value except for demonstrating how splines can be used. The code here could be quite easily changed to replace missing values with interpolated ones.

splines2

To finish off, here’s a demonstration that the splines function is quite capable of dealing with fine-graded interpolations:

points(seq(1,65,.5), a(seq(1,65,.5)), col=3)

Moving averages or LOESS?

Moving averages can be quite useful when visualizing data, or to fill holes in missing data. I suspect one reason for the popularity is that Excel can produce them (one of the trend lines it offers). Another reason may be that the concept of moving averages isn’t difficult: it’s just a repeated application of means.

In this post, I want to advocate the use of smoothed trend lines (LOESS) in many of the applications moving averages are used. Like moving averages, we’re talking of a non-parametric method, but unlike moving averages we’re less dependent on analysts deciding on the level of smoothness since there are adequate automatic methods.

Let’s look at an example. We begin with defining the data:

data <- c(1.3, 1.5, 1.9, 1.6, 1.7, 0.9, 1.8, 1.9, 2.4, 2.3, 1.8)

Then we define the moving average function, as this isn’t built into R.

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=2)}

We can now calculate the moving average, but as ever, it’s better to have look:

mav(data) # just the moving averages
plot(data) # plot the data (as circles)
lines(mav(data)) # plot the moving average (as line)

What does it look like with a LOESS trendline?

scatter.smooth(data)

Yes, that’s a built-in function, and yes, it works out of the box with no additional argument. If needed (normally there is no need, as Luke Keele demonstrates), the span can be set manually, and for more complex applications the sister function loess.smooth() can be quite useful, too.

loess-eg

Fox, John. 2000. Nonparametric Simple Regression: Smoothing Scatterplots. Quantitative Applications in the Social Sciences 130. Thousand Oaks: Sage Publications.

Keele, Luke. 2006. “How to Be Smooth: Automated Smoothing in Political Science.”

Moving average to fill holes (interpolation)

One way moving averages can be used, is to fill holes in time series. Consider the standard situation at the top of the illustration: the moving average simply averages within a given window. The value for the dot in red is replaced by the average of all values in the red box; the value for the dot in green is replaced by the average of all values in the green box; and so on.

namav

With holes in the data (situation at the bottom), we can use all the data available in a given window. This means that sometimes we use many numbers, and sometimes we have just a few. In this example, the average calculated for the red and green position is the same; with the green line indicating the here absent green dot.

For a numerical example, consider the following time series with plenty of holes as data:

data <- c(NA,.223,NA,NA,.359,NA,NA,.302,NA,NA,NA,NA,.260,NA,.391)

Using the following function, I simply average over the holes, using as much data as available in a given time window.

namav <- function(x,k=3){
x <- c(rep(NA, k),x,rep(NA,k)) # add NA on both sides
n <- length(x)
return(sapply((k+1):(n-k), function(i) sum(x[(i-k):(i+k)],na.rm=TRUE)/(2*k+1-sum(is.na(x[(i-k):(i+k)])))))
}

The value for k determines the width of the time window. The following graph illustrates how different values for k play out in these data. With k=0, only the actual data are shown. With k=1, the data are also applied to the point before and after; with k=2 we get a contiguous time series: Each point is the average of all available data points within a (moving) window that includes two values before and two values after, and the point itself, of course. The higher the value of k, the closer we get to the mean across all time points (i.e. a flat line).

namavPl

Moving Averages in R

To the best of my knowledge, R does not have a built-in function to calculate moving averages. Using the filter function, however, we can write a short function for moving averages:

mav <- function(x,n=5){stats::filter(x,rep(1/n,n), sides=2)}

We can then use the function on any data: mav(data), or mav(data,11) if we want to specify a different number of data points than the default 5; plotting works as expected: plot(mav(data)).

In addition to the number of data points over which to average, we can also change the sides argument of the filter functions: sides=2 uses both sides, sides=1 uses past values only.