Moving Averages in R

To the best of my knowledge, R does not have a built-in function to calculate moving averages. Using the filter function, however, we can write a short function for moving averages:

mav <- function(x,n=5){stats::filter(x,rep(1/n,n), sides=2)}

We can then use the function on any data: mav(data), or mav(data,11) if we want to specify a different number of data points than the default 5; plotting works as expected: plot(mav(data)).

In addition to the number of data points over which to average, we can also change the sides argument of the filter functions: sides=2 uses both sides, sides=1 uses past values only.

26 thoughts on “Moving Averages in R

  1. I am looking for an efficient way to “interpolate” missing values in a time series, that contain a variable number of missing values (missing values are represented by NAs in the data). I have been working with rollapply, but have found that the width parameter needs to be the length of the longest string of NAs plus one, otherwise some NAs remain. This can be a porblem since the longer you make the width of the rolling average window, the more “smoothing” occurs in your data. In his particualr case, I want to leave the measured values in the time series as they are, and only fill in the NAs with interpolated values.

    1.1, 2.2, NA, 3.3, 4.4, 5.5, NA, NA, NA, 6.6, 7.7

    Should fill the first NA with the mean of 2.2 and 3.3, while the last block of 3 NAs should all fill the mean of 5.5 and 6.6.

    Any suggestions would be much appreciated.

  2. There’s no reason you have to use the moving average on its own. Here’s a procedure that does what you want. (1) calculate moving averages with different spans. (2) use the code in the following post and adjust it to replace the NA. This will be a reiterative process, where you first replace single NA (e.g. the NA between 2.2 and 3.3), and then longer series of NA (e.g. the 3 NA between 5.5 and 6.6) etc. until the longest gap is filled.

    1. Thank you for the reply Didier. Is there a way to put up a small datafile (ts.txt) that the following code will work with?
      # datafill example

      require(xts)
      require(TTR)
      require(PerformanceAnalytics)
      require(RColorBrewer)

      function(x,n=5){filter(x,rep(1/n,n), sides=2)}

      dd <- read.table("~/Downloads/ts.txt", skip=1) # read in data without first row
      dd <- dd[,-1] # remove first column

      colnames(dd) <- c("temp", "date", "time") # name the three remaining columns

      dt <- as.POSIXct(paste(dd$date, dd$time))
      tempdata <- xts(dd$temp, order.by = dt)
      colnames(tempdata) <- "temp"
      tempdata$temp_missing <- tempdata$temp

      # create a column with some missing values
      tempdata$temp_missing[10,1] <- NA
      tempdata$temp_missing[15:16,1] <- NA
      tempdata$temp_missing[30:32,1] <- NA
      tempdata$temp_missing[40:44,1] <- NA
      tempdata$temp_missing[54:60,1] <- NA

      1. Thanks for the code and the sample data you’ve sent separately. I see two avenues:

        (1) you use the code I’ve put together here [https://druedin.com/2013/06/15/moving-average-to-fill-holes-interpolation/] iteratively. namav(variable,0) does nothing to the data. namav(variable,1) uses the smallest span, etc. Consider the following code:

        plot(namav(variable,0), type=”l”) # just the data
        for(i in 1:4) lines(namav(variable,i), col=i) # with your sample data, a span of 4 will fill all holes

        By using iteratively I mean that you can calculate namav(variable,1) first, use the corresponding values to fill in the gaps: variable[which(is.na(variable))] <- namav(variable,1)[which(is.na(variable))] where which(is.na(variable)) identifies the positions with missing data. If you repeat this with namav(variable,2) etc., you should end up with a complete data set, where the original data are maintained, and shorter moving averages are used if possible.

        To identify the maximum NA run, you can use the R function rle() as follows. Note that rle() doesn't consider consecutive NA as equivalent, so I first replace them with a clearly implausible value: variable2 <- NA; variable2[is.na(variable)] <- -999; max(rle(as.numeric(variable2))$lengths) should then give me longest run.

        (2) You forget about moving averages, and use splines. Compare [https://druedin.com/2014/09/13/moving-averages-or-loess/]. This is quite flexible, and in my view in many ways preferable (not only because of the simpler code).

        miss <- which(is.na(variable)) # missing; for easier code below
        a <- splinefun(variable) # spline function for your data
        plot(as.numeric(variable), type="l") # just plot the data; as.numeric() to use the generic plot
        points(1:70, a(1:70), col=2) # plot everything; sample data has length 70
        points(miss, a(miss), col=5) # plot all missing; this can easily be modified to fill in the gaps in the data (rather than plotting)
        points(15, a(15), col=3) # plot a single point (missing)

        In either case, you need to think carefully about how you want to deal with uncertainty. By replacing missing values with (whatever form of) averages, you imply that you have measured these points… Multiple imputation is a way out of this, but depending on the analysis you have in mind not so easy to use.

  3. Hi Didier,

    Thanks for this function, this is very helpful.

    I have monthly time series data and am interested in calculating changes over a periodic basis, as I change n. Accordingly, I’d like to calculate differences on a month-to-month time frame, thus setting n=1. I’d also like the function to calculate year-on-year differences, thus setting n=12. Since I’m an R beginner, would you be able to show me how to amend your function to do this?

    Kind regards,
    Kyle

    1. Thanks for your interest. I’m not sure I understand your data structure and problem. You seem to have a time series with monthly observations. First, let me generate 36 data points by drawing from a normal distribution: data <- rnorm(36) I then modified the moving average function to only use past values (rather than looking into the past and the future) by setting sides=1: mav <- function(x,n=5){filter(x,rep(1/n,n), sides=1)} This way I can calculate the moving average as you seem to have done already: yearly <- mav(data, n=12). If I understand you correctly, you then want to calculate the difference between any point in time and the situation a year earlier on this moving average. We can do this using sapply (see also here and here.): len <- length(yearly) and then sapply(13:len, function(i) {yearly[i] - yearly[i-12]}).

  4. Thank you Didier, this was very helpful. The only thing I would add is that in the current syntax, there could be some conflict with other packages that use the function filter. I would just replace ‘filter’ by ‘stats::filter’

    1. Thanks, François! You’re right, using ‘stats::filter’ explicitly is safer (I have modified the post to reflect this), although package authors should avoid package names from the base packages for their own functions…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.