Using Splines to Fill Holes (Interpolation)

Following a comment I recently had on using moving averages, I wanted to share more widely how we can use splines to fill holes in (time series) data. Contrary to the moving-average based approach I posted earlier, using splines we can keep the observed values and just interpolate where there are missing values. What is more, in many cases splines are more appropriate, and actually surprisingly easy to use.

First we need some data with missing values (shamelessly nicked from Kirk, although modified to protect the original should this be necessary):
splines1

z <- c(-1.1484, -1.3842, -1.5985, -1.0626, -1.3413, -1.2341, -1.1269, NA, -0.7411, -0.7840, -0.6125, -0.8912, NA, NA, -1.1912, -1.7271, -1.0841, -0.9555, -0.9555, -0.6554, -0.4196, -0.5268, -0.3767, -0.2695, 0.2019, NA, NA, NA, 1.1880, 0.9736, 0.7807, 0.5878, 1.3594, 0.6306, 1.3809, NA, NA, NA, NA, NA, 1.5738, 1.6595, 1.1665, 0.9950, 1.7238, 1.1022, 1.1451, NA, 0.7807, NA, NA, NA, NA, NA, NA, NA, 0.8450, 0.5449, 0.2662, 0.8021, 0.4806, 0.1376, 0.5449, 0.2019, 0.4592)

First, we’re looking at the data.

plot(z, type="l", bty="n")

To identify the location of the missing values, we can use the following code (it also keeps things slightly simpler below):

miss <- which(is.na(z))

The core of the approach presented here is the splines function, part of R. It defines a function, here I called it a().

a <- splinefun(z)

We can now use this function to get individual points using splines to interpolate the curve. For instance, here I plot the values for each integer. Note that this plots on top of the observed values, but easily fills the holes. 65 is the length of the data vector.

points(1:65, a(1:65), col=2)

We can highlight the missing values (now interpolated) by plotting them in a different colour:

points(miss, a(miss), col=5)

And here’s how to identify a single point: the value for y at x=14.

points(14, a(14), col=3)

This figure has not much value except for demonstrating how splines can be used. The code here could be quite easily changed to replace missing values with interpolated ones.

splines2

To finish off, here’s a demonstration that the splines function is quite capable of dealing with fine-graded interpolations:

points(seq(1,65,.5), a(seq(1,65,.5)), col=3)

Moving averages or LOESS?

Moving averages can be quite useful when visualizing data, or to fill holes in missing data. I suspect one reason for the popularity is that Excel can produce them (one of the trend lines it offers). Another reason may be that the concept of moving averages isn’t difficult: it’s just a repeated application of means.

In this post, I want to advocate the use of smoothed trend lines (LOESS) in many of the applications moving averages are used. Like moving averages, we’re talking of a non-parametric method, but unlike moving averages we’re less dependent on analysts deciding on the level of smoothness since there are adequate automatic methods.

Let’s look at an example. We begin with defining the data:

data <- c(1.3, 1.5, 1.9, 1.6, 1.7, 0.9, 1.8, 1.9, 2.4, 2.3, 1.8)

Then we define the moving average function, as this isn’t built into R.

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=2)}

We can now calculate the moving average, but as ever, it’s better to have look:

mav(data) # just the moving averages
plot(data) # plot the data (as circles)
lines(mav(data)) # plot the moving average (as line)

What does it look like with a LOESS trendline?

scatter.smooth(data)

Yes, that’s a built-in function, and yes, it works out of the box with no additional argument. If needed (normally there is no need, as Luke Keele demonstrates), the span can be set manually, and for more complex applications the sister function loess.smooth() can be quite useful, too.

loess-eg

Fox, John. 2000. Nonparametric Simple Regression: Smoothing Scatterplots. Quantitative Applications in the Social Sciences 130. Thousand Oaks: Sage Publications.

Keele, Luke. 2006. “How to Be Smooth: Automated Smoothing in Political Science.”