Week 4 Activity

Correlation vs. Causation, Crosstabulation 2.0

I recommend that you attempt to use a Do-File for the activity this week. Directions can be found below.

This week, you will begin to look more carefully at the relationships between variables…

Here you will find directions for your assignment for week 4. You’ll be using Stata commands to enrich your understanding of correlation.You’ve familiarized yourself with crosstabs in weeks 2 and 3. We’ll begin there. At this point in the semester, you’ll be invited to begin working with variables you’re interested in. You can do this by familiarizing yourself with the GSS codebook and with the techniques you’ve learned in Stata already. Take some time to play around here.

While this is presumptuous, I’d recommend that begin with a simple hypothesis that there could reasonably be a relationship between a demographic variable and an attitudes/behaviors variable. For example, you could reasonably anticipate a relationship between “sex” and “number of sexual partners reported in the past year.” So you’d select “sex” and “partners.” You’ll lock in your decision in a couple weeks, but if you’d like to play around with them in the activities, that might be worth your time.

Let’s begin with a simple cross tab and build from there.

First, you need to open your dataset:

use “L:\stats 2020.dta”

Let’s imagine that we’re interested in the two variables, “sex,” and “partners.” Remember, we can produce frequency distributions for multiple variables by using the Stata command "tab1".

tab1 sex partners

Now we can create a cross tab of these two variables by using the command:

tab sex partners

But this is a bit difficult to interpret.

Why is this output difficult to interpret?

Sometimes when we are interested in relationships between variables, it makes sense to re-code a variable to make it simpler. In this case, the variable “partners” is divided into 9 categories. But what if we were only interested in whether the respondents (1) were celibate, (2) had only one partner, (3) had multiple partners. This way we could reduce the variable to 3 categories.

Simplifying variables comes at a cost. What are some problems this could introduce? What questions would you ask yourself when determining whether or not it was appropriate to recode in this way?

First we’ll use “nol” to access the numerical markers for the categories within the “partners variable.”

tab partners

tab partners,nol

Now that we can “see” the field, we can create a new variable by recoding. We’ll break this into pieces. Follow along on p. 133-135 of the Longest text.

We want to keep the category “no partners” as is. So the first element will read recode partners (0=0). We also want to keep the category "1 partner” as is. So the second element will read recode partners (1=1). We want to collapse everyone who has reported more than one sexual partner into one category. So the third element will read recode partners (2/7=3). This signals to Stata that we’d like to recode all values between 2 and 7 as “3” in the new variable. You’ll notice that I’ve left out the value “9” which is the “Don’t know” category. We will keep it as is now, but later we will learn how to drop those cases if necessary, recode partners (9=9). At the end, we will include the command “gen” to generate a new variable. All together that reads:

recode partners (0=0) (1=1)(2/7=3)(9=9), gen(partnersrecoded)

This will generate a new variable called “parnersrecoded.” We can now check in on that new variable!

tab partnersrecoded

Include your output and a brief interpretation of the output in your journal.

Now we can create a crosstab that’s a bit easier to interpret, for the purposes of our particular focus.

tab sex partnersrecoded

We could simplify it further, by creating a variable that marks whether the respondent was sexually active or sexually inactive in the prior year. Try to create the syntax on your own before checking mine below…

Great work! I created the syntax below.

recode partners (0=0)(1/9=1),gen(sexactiverecoded)

We may now want to crosstabulate this new variable with “sex.”

tab sex sexactiverecoded,row

Interpret the output in a sentence or two. What might you say, based on your readings in Salkind & Frey about this output?

As you might imagine, this information might be easier to understand by visualizing the data. Skip ahead to pg. 151 in the Longest text to follow along about how we might create a multivariate bar graph to display this data.

Without my directions here, try to create a multivariate bar chart of the variables “sex” and “sexactiverecoded". Be strategic about how you label the chart. Save the output to your journal. And interpret in a sentence or two.
With all the outputs you’ve examined, involving the variables “sex”, “partners”, “partnersrecoded”, and “sexactiverecoded”, can you identify an output that could be easily mis-read, mis-interpreted or manipulated?

Now skip ahead to pg. 184 in the Longest text to follow along. We’re going to learn one more tool for interpreting correlation: scatterplots (sometimes called “scattergrams”). We’ll work with the original “partners” variable first, to look at the measures of central tendency and variability.

Why is this variable more appropriate for these techniques than the recoded variables?

sum partners, det

Now we could produce a scatter plot of “partners” against “sex” but this would look a bit odd. Let’s try it.

scatter sex partners

Why is this not particularly interesting to interpret?
Can you imagine a variable that might be more interesting than “sex”?

Let’s try “age.”

scatter age partners

Now this output gives us something to interpret!

How would you describe the relationship between these two variables for the GSS respondents? Use the Salkind & Frey chapter to assist you.

Finally, we’ll use the “correlation” command or “corr” to interpret the data further. Turn to page 188 in the Longest text to follow along.