## Custom Tables of Descriptive Statistics in R

Here’s how we can quite easily and flexibly create tables of descriptive statistics in R. Of course, we can simply use `summary(variable_name)`, but this is not what you’d include in a manuscript — so not what you want when compiling a document in knitr/Rmarkdown.

First, we identify the variables we want to summarize. Often our database includes many more variables:

vars <- c("variable_1", "variable_2", "variable_3")

Note that these are the variable names in quotes. Second, we use `lapply()` to calculate whatever summary statistic we want. This is where flexibility kicks in: have you ever tried to include an interpolated median in such a table, just as easy as the mean in R. Here’s an example with the mean, minimum, maximum, and median:

`v_mean <- lapply(dataset[vars], mean, na.rm=TRUE)`
`v_min <- lapply(dataset[vars], min, na.rm=TRUE)`
`v_max <- lapply(dataset[vars], max, na.rm=TRUE)`
`v_med <- lapply(dataset[vars], median, na.rm=TRUE)`

Too many digits? We can use `round()` to get rid of them. There’s actually an argument ‘digits’ in the `kable()` command we’ll use in a minute that in principle allows rounding at the very end, but unfortunately it often fails on me. Rounding:

`v_mean <- round(as.numeric(v_mean), 2)`

Now we only need to bring the different summary statistics together:

`v_tab <- cbind(mean=v_mean, min=v_min, max=v_max, median=v_med)`

`rownames(v_tab) <- c("Variable 1", "A description of variable 2", "Variable 3")`

and we use `kable()` to generate a decent table:

`kable(v_tab)`

If this looks complicated, bear in mind that with no additional work you can change the order of the variables and include any summary statistics. That’s table A1 in the appendix sorted.

## Set Encoding When Importing CSV Data in R Under Windows

This is another simple thing I keep on looking up: how to set the encoding when importing CSV data in R under Windows. I need this when my data file is in UTF-8 (pretty standard these days), but I’m using R under Windows; or when I have a Windows-encoded file when using R elsewhere. The default encoding in Windows is not UTF-8, and R uses the default encoding — well, by default. Typically this is not an issue unless my data file contains accented characters in strings, which can lead to garbled text when the wrong encoding is set/assumed.

The solution is quite simple: add `encoding=""` to the `read.csv()` command, like this:

`x <- read.csv("datafile.csv", encoding="Windows-1252")`

or like this:

`x <- read.csv("datafile.csv", encoding="UTF-8")`

## How to run a regression on a subset in R

Sometimes we need to run a regression analysis on a subset or sub-sample. That’s quite simple to do in R. All we need is the `subset` command. Let’s look at a linear regression:

`lm(y ~ x + z, data=myData)`

Rather than run the regression on all of the data, let’s do it for only women, or only people with a certain characteristic:

`lm(y ~ x + z, data=subset(myData, sex=="female"))`

`lm(y ~ x + z, data=subset(myData, age > 30))`

The `subset()` command identifies the data set, and a condition how to identify the subset.

## How to Calculate the Mode in R

This is something I keep looking up, because for whatever reason R does not come with a built-in function to calculate the mode. (The `mode()` function does something else, not what I’d expect given that there are `mean()` and `median()`…) It’s quite easy to write a short function to calculate the mode in R:

```Mode <- function(x) { uni <- unique(x) uni[which.max(tabulate(match(x, uni)))] }```

## Predicted Probabilities in R

I got recently asked how to calculate predicted probabilities in R. Of course we could do this by hand, but often it’s preferable to do this in R. The command we need is `predict()`; here’s how to use it.

First we need to run a regression model. In this example, I predict whether a person voted in the previous election (binary dependent variable) with variables on education, income, and age. I use logistic regression:

`m <- glm(voted ~ edu + income + age, family="binomial", data=voting)`

Now comes the not so obvious part: we need to specify the cases we are interested in. We’re using `with()` and `data.frame()` to do so. In this case, I’m interested in the predicted probabilities of two people, both with average (mean) education and income, but one aged 20 and the other aged 60. In this step, we need to cover all the independent variables.

`newdata <- with(voting, data.frame(age = c(20, 60), edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

Finally we can get the predictions:

`predict(m, newdata, type="response")`

That’s our model `m` and `newdata` we’ve just specified. `type="response"` calculates the predicted probabilities. We get

``` 1 2 0.3551121 0.6362611 ```

So 36% for the person aged 20, and 64% for the person aged 60.

Often, however, a picture will be more useful. The logic is the same. We use the same model, and ask R to predict for every age from 18 to 90 (I guess you don’t want to do this by hand).

`newdata2 <- with(voting, data.frame(age = 18:90, edu=mean(edu, na.rm=TRUE), income=mean(income, na.rm=TRUE)))`

We once again use `predict()`, but this time also ask for standard errors.

`preds <- predict(m, newdata2, type="response", se.fit=TRUE)`

For the plot, I want the predicted probabilities +/- 1.96 standard errors (that’s the 95% confidence interval; use `qnorm(0.975)` if 1.96 is not precise enough). I extract and calculate the values for each line separately to better understand the code.

```predf <- preds\$fit # predicted lower <- preds\$fit - (1.96*preds\$se.fit) # lower bounds upper <- preds\$fit + (1.96*preds\$se.fit) # upper bounds ```

Finally the plot: It’s a simple line plot of the predicted probabilities plotted against the age (18 to 90). I choose not to show the borders of the plot, and then use `lines()` twice to add the lower and upper bounds.

```plot(18:90, predf, type="l", ylab="Predicted Probability to Vote", xlab="Age", bty="n") lines(18:90, lower, lty=2) lines(18:90, upper, lty=2) ``` 