Combining two strings in R when they are in a vector

Joining two strings in R can readily be done with the paste() command. However, if our strings are part of a vector, paste() no longer works as we might expect.

To illustrate this, assume a list of characters or symbols, and a set of values you want to convert into these characters. The characters “A” to “F” may be more transparent for the example, but it was the bars that led me to this problem: _ ▁ ▂ ▃ ▅ ▇. I use the Unicode values of these bars.

characters = c("A", "B", "C", "D", "E", "F")
characters = c("_", "\u2581", "\u2582", "\u2583", "\u2585", "\u2587")

Let’s start with a couple of random values:

values = runif(5)

In my example, I got:

[1] 0.9568333 0.4533342 0.6775706 0.5726334 0.1029247

Then we want to select one of the characters or symbols based on these values, picking the longer bar for longer values — effectively creating our own histogram (sparkline).

selected = characters[round(values * 5)+1]

This works well, except that we have a series of characters rather than a single string with all the characters included:

"▇" "▂" "▃" "▃" "▁"

My next step was paste() or actually the shorthand paste0() which does not use separation, but it’s not really creating a single string:

paste0(selected)

[1] "▇" "▂" "▃" "▃" "▁"

enter library(stringr) and the str_c() command:

str_c(selected, sep="", collapse="")

[1] "▇▂▃▃▁"

There we go, a single string…

Post-Doc position/ Quantitative Social Scientist/ Migration in Potsdam

Jasper Tjaden is looking for a post doc to work with him in Potsdam starting October 2022. The contract is initially two years with the possibility of extension.

Interested candidates should send their CV and a short (one paragraph) motivation (in the body of the email).

Candidates should have:

  • Interest in migration/ integration studies (!)
  • Advanced R/ Stata skills
  • Good understanding of econometrics
  • Good understanding of causal inference
  • Experience collecting data/ survey methodology
  • Interest in/ experience with digital data (Facebook/ Google etc.)
  • Team player
  • Solid publication record

Prof. Dr. Jasper Tjaden

Faculty of Economic and Social Sciences
Professor of Applied Social Research & Public Policy

https://jaspertjaden.com/

Plotting a normal distribution in R

Apparently there are some unnecessarily complicated tutorials out there how to draw a normal distribution (or other probability distributions) in R. No, there is no need for a loop; in fact, a single line of code is enough:

curve(dnorm(x, 0, 1), from=-4, to=4)

That’s a normal probability distribution with mean 0 and a standard deviation 1, plotted from -4 to +4.

So it’s also easy to draw normal distributions with different means or different standard deviations. Here’s one without a box around:

curve(dnorm(x, 2, 1), from=1, to=7, bty="n", xlab="")

How about a beta distribution? It’s not more difficult (assuming you know your shape parameters):

curve(dbeta(x, 10, 2), from=0, to=1, bty="n", xlab="")

So from now on, if you need to visualize your priors to get a feel whether they constitute reasonable distributions, remember it’s just one line in R.

How to add text labels to a scatter plot in R?

Adding text labels to a scatter plot in R is easy. The basic function is text(), and here’s a reproducible example how you can use it to create these plots:

Adding text to a scatter plot in R

For the example, I’m creating random data. Since the data are random, your plots will look different. In this fictitious example, I look at the relationship between a policy indicator and performance. It is conventional to put the outcome variable on the Y axis and the predictor on the X axis, but in this example there’s no relationship to reality anyway… The reason I chose min and max values for the random variables here is that I jotted down this code as an explanation for a replication. In this example, we have 25 observations, for 25 units I call “cantons”. The third line here creates a string of characters “A” to “Y”, these are the labels!

policy = runif(25, min=0.4, max=0.7)
perfor = runif(25, min=500, max=570)
canton = sapply(65:89, function(x) rawToChar(as.raw(x)))

For the scatter plot on the left, we use plot(). Then we add the trend line with abline() and lm(). To add the labels, we have text(), the first argument gives the X value of each point, the second argument the Y value (so R knows where to place the text) and the third argument is the corresponding label. The argument pos=1 is there to tell R to draw the label underneath the point; with pos=2 (etc.) we can change that position.

plot(policy ~ perfor, bty="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton, pos=1)

The scatter plot on the right is similar, but here we actually plot the labels instead of the dots. There are two differences in the code: First, we add type="n" to create the scatter plot without actually drawing any circles (an empty plot if you will). Second, when we add the text in the third line of the code, we do not have pos=1, because we want to place the labels exactly where the points are.

plot(policy ~ perfor, bty="n", type="n", ylab="Policy Indicator", xlab="Performance", main="Policy and Performance")
abline(lm(policy ~ perfor), col="red")
text(perfor, policy, canton)

Calculating VIF by hand

A widespread measure of multicollinearity is the VIF (short for variance inflation factor). Multicollinearity describes the situation when the predictor variables in a multiple regression model are highly correlated, which is usually not desirable (assuming you haven’t gone Bayesian yet).

In R, the VIF can easily be calculated with a function in library car. It’s actually not difficult to do it by hand — which incidentally helps understand what we measure with the VIF, or why there is no different VIF for logistic regression models, or why the VIF is better than looking at bivariate correlations between predictors.

We start with some random data to run the multiple regression model. Here we create one outcome (y) and three predictor variables (x, z, a), full of random numbers. That’ll do for a demonstration.

x = runif(50) 
y = runif(50)
z = runif(50)
a = runif(50)

Here’s a simple OLS model:


m = lm(y ~ x + z + a)

If you have library car installed, you can easily calculate the VIF:

library(car)
vif(m)

To do it by hand, though, we run a linear regression model (OLS) for each of the predictors. Here’s the code for predictor x. One of the predictors becomes the outcome variable (here x), and the other predictors remain predictors. The variable used as the outcome previously (y) does not appear here.

mx = lm(x ~ z + a)

The VIF is simply: 1/(1-R²) of this model. In R, we can run the following:

1/(1-summary(mx)$r.squared)