To the best of my knowledge, R does not have a built-in function to calculate moving averages. Using the filter function, however, we can write a short function for moving averages:
mav <- function(x,n=5){stats::filter(x,rep(1/n,n), sides=2)}
We can then use the function on any data: mav(data), or mav(data,11) if we want to specify a different number of data points than the default 5; plotting works as expected: plot(mav(data)).
In addition to the number of data points over which to average, we can also change the sides argument of the filter functions: sides=2 uses both sides, sides=1 uses past values only.
I am looking for an efficient way to “interpolate” missing values in a time series, that contain a variable number of missing values (missing values are represented by NAs in the data). I have been working with rollapply, but have found that the width parameter needs to be the length of the longest string of NAs plus one, otherwise some NAs remain. This can be a porblem since the longer you make the width of the rolling average window, the more “smoothing” occurs in your data. In his particualr case, I want to leave the measured values in the time series as they are, and only fill in the NAs with interpolated values.
1.1, 2.2, NA, 3.3, 4.4, 5.5, NA, NA, NA, 6.6, 7.7
Should fill the first NA with the mean of 2.2 and 3.3, while the last block of 3 NAs should all fill the mean of 5.5 and 6.6.
Any suggestions would be much appreciated.
There’s no reason you have to use the moving average on its own. Here’s a procedure that does what you want. (1) calculate moving averages with different spans. (2) use the code in the following post and adjust it to replace the NA. This will be a reiterative process, where you first replace single NA (e.g. the NA between 2.2 and 3.3), and then longer series of NA (e.g. the 3 NA between 5.5 and 6.6) etc. until the longest gap is filled.
Thank you for the reply Didier. Is there a way to put up a small datafile (ts.txt) that the following code will work with?
# datafill example
require(xts)
require(TTR)
require(PerformanceAnalytics)
require(RColorBrewer)
function(x,n=5){filter(x,rep(1/n,n), sides=2)}
dd <- read.table("~/Downloads/ts.txt", skip=1) # read in data without first row
dd <- dd[,-1] # remove first column
colnames(dd) <- c("temp", "date", "time") # name the three remaining columns
dt <- as.POSIXct(paste(dd$date, dd$time))
tempdata <- xts(dd$temp, order.by = dt)
colnames(tempdata) <- "temp"
tempdata$temp_missing <- tempdata$temp
# create a column with some missing values
tempdata$temp_missing[10,1] <- NA
tempdata$temp_missing[15:16,1] <- NA
tempdata$temp_missing[30:32,1] <- NA
tempdata$temp_missing[40:44,1] <- NA
tempdata$temp_missing[54:60,1] <- NA
Thanks for the code and the sample data you’ve sent separately. I see two avenues:
(1) you use the code I’ve put together here [https://druedin.com/2013/06/15/moving-average-to-fill-holes-interpolation/] iteratively. namav(variable,0) does nothing to the data. namav(variable,1) uses the smallest span, etc. Consider the following code:
plot(namav(variable,0), type=”l”) # just the data
for(i in 1:4) lines(namav(variable,i), col=i) # with your sample data, a span of 4 will fill all holes
By using iteratively I mean that you can calculate namav(variable,1) first, use the corresponding values to fill in the gaps: variable[which(is.na(variable))] <- namav(variable,1)[which(is.na(variable))] where which(is.na(variable)) identifies the positions with missing data. If you repeat this with namav(variable,2) etc., you should end up with a complete data set, where the original data are maintained, and shorter moving averages are used if possible.
To identify the maximum NA run, you can use the R function rle() as follows. Note that rle() doesn't consider consecutive NA as equivalent, so I first replace them with a clearly implausible value: variable2 <- NA; variable2[is.na(variable)] <- -999; max(rle(as.numeric(variable2))$lengths) should then give me longest run.
(2) You forget about moving averages, and use splines. Compare [https://druedin.com/2014/09/13/moving-averages-or-loess/]. This is quite flexible, and in my view in many ways preferable (not only because of the simpler code).
miss <- which(is.na(variable)) # missing; for easier code below
a <- splinefun(variable) # spline function for your data
plot(as.numeric(variable), type="l") # just plot the data; as.numeric() to use the generic plot
points(1:70, a(1:70), col=2) # plot everything; sample data has length 70
points(miss, a(miss), col=5) # plot all missing; this can easily be modified to fill in the gaps in the data (rather than plotting)
points(15, a(15), col=3) # plot a single point (missing)
In either case, you need to think carefully about how you want to deal with uncertainty. By replacing missing values with (whatever form of) averages, you imply that you have measured these points… Multiple imputation is a way out of this, but depending on the analysis you have in mind not so easy to use.
What library/package is required for rollmean ro work?
zoo
Hi Didier,
Thanks for this function, this is very helpful.
I have monthly time series data and am interested in calculating changes over a periodic basis, as I change n. Accordingly, I’d like to calculate differences on a month-to-month time frame, thus setting n=1. I’d also like the function to calculate year-on-year differences, thus setting n=12. Since I’m an R beginner, would you be able to show me how to amend your function to do this?
Kind regards,
Kyle
Thanks for your interest. I’m not sure I understand your data structure and problem. You seem to have a time series with monthly observations. First, let me generate 36 data points by drawing from a normal distribution:
data <- rnorm(36)
I then modified the moving average function to only use past values (rather than looking into the past and the future) by setting sides=1:mav <- function(x,n=5){filter(x,rep(1/n,n), sides=1)}
This way I can calculate the moving average as you seem to have done already:yearly <- mav(data, n=12)
. If I understand you correctly, you then want to calculate the difference between any point in time and the situation a year earlier on this moving average. We can do this usingsapply
(see also here and here.):len <- length(yearly)
and thensapply(13:len, function(i) {yearly[i] - yearly[i-12]})
.Thank you Didier, this was very helpful. The only thing I would add is that in the current syntax, there could be some conflict with other packages that use the function filter. I would just replace ‘filter’ by ‘stats::filter’
Thanks, François! You’re right, using ‘stats::filter’ explicitly is safer (I have modified the post to reflect this), although package authors should avoid package names from the base packages for their own functions…