Logistic regression can be equivalent to contingency table explained by an example

A three-way contingency table can also be interpreted as a logistic regression with two binary independent variables. I have just encountered a very good example illustrating how logistic regression and contingency tables will obtain the same odds ratio.

The study is on 2015 dataset from Youth Risk Behavior Surveillance System (YRBSS). It is a national school-based survey conducted by CSD on more than 10,000 youth and students in the United States.

We start with a plausible hypothesis: does the sexual minority (LGBT) group who use the computer more often (meaning > 5 hours), more likely to be attempt suicide?

First of all, this question is incomplete. There are 3 ways to complete it:
  1. Is the LGBT group who use the computer more often, more likely to attempt suicide than the majority group who also use the computer more often? 
  2. Is the LGBT group who use the computer more often, more likely to attempt suicide than the LGBT group who use the computer less often? 
  3. Within each group (LGBT group and the majority group), compare kids who used PC more often to those who use PC less often, which group is under the larger influence? In another word, assume both LGBT and the majority kids are more likely to attempt suicide if use PC more often, which group is "more" more likely? 
(I hope they don't read the same ...)

The three coefficients estimated for this logistic regression will answer all the three questions.
But let's looks at them separately first:

1. For the kids who use computers more often, is the LGBT group more likely to attempt suicide than the majority group?

This is a 2-by-2 table between (Attempt Suicide Y/N) and (LGBT/Majority Group). For the use-PC-more-often kids, in the 530 LGBT kids, 26%  have attempted suicide, while only 8% of the majority 4,583 kids have attempted suicide. In the use-PC-more-often kids, LGBT group is more likely to attempt suicide than the majority group with OR = 4.271 (a).

Then how about for the use-PC-less-often kids? Is the situation similar?
For the use-PC-less-often kids, 33% have attempted suicide in the 469 LGBT kids and 5% have attempted in the 6,736 majority kids. LGBT kids are even more likely than the majority kids to attempt suicide with OR = 8.720 (b)

So the answer to question 1 is: (No matter using PC more often or less often, LGBT group is always more likely to attempt suicide than the majority group.)

Moreover, we can already tell from here the answer to question 3: (In both LGBT and majority group, using PC more often is associated with higher risk to attempt suicide. But among the use-PC-less-often kids, LGBT kids are even more likely to attempt suicide than the kids in the majority group since OR has almost doubled. ) 

In another word, the effect of being in the LGBT group that leads to higher OR to attempt suicide is stronger for those who use PC less often. The intuition is: since there are much more other things that can trouble the LGBT kids, they don't have to use PC that much to lead to some bad behaviors (e.g. attempt suicide). But for the majority kids, using PC so often does mean something and has a stronger association with actions like attempting suicide than for the LGBT kids.

*** ***
2. Within the LGBT group, are the kids who use the computer more often more likely to be attempt suicide than the kids who use the computers less often?

Among the 469  LGBT kids who use PC less often, 53% have attempted suicide; among the 530 LGBT kids who use PC more often, 47% have attempted suicide. Within the LGBT group, using PC more often is associated with more likely to attempt suicide with an OR = 0.699 (c). Since OR < 1, it means LGBT kids using PC more often is less likely to attempt suicide compared to kids using PC less often. 

So the answer to question 2 is: (No, the opposite, within the LGBT group, using PC more often is associated with less likelihood to attempt suicide compared to using PC less often.) 

On the other hand, among the 6,736 majority kids who use PC less often, only 5.4% have attempted suicide. Among the 4,583 majority kids who use PC more often, 7.6% have attempted suicide. For the majority group, Using PC more often is associated with increased risk to attempt suicide. OR = 1.427 (d)

Again, it tells that the effect of being in the LGBT group that leads to higher OR to attempt suicide is stronger for those who use PC less often. 


*** ***
The logistic regression model (weighted) to address these questions is: 
logit(P(Attempt Suicide)) = beta0 + beta1 (LGBT) + beta2 (Use PC more often) + beta3 (LGBT) (PC Use for Long Hours )
Parameter estimations:
logit(P(Attempt Suicide)) = -2.857 + 2.166 (LGBT) + 0.356 (Use PC more often) - 0.714  (LGBT) (PC Use for Long Hours )
(Use PC more often) is a binary variable with 0: Less often, 1: More often
(LGBT) is a binary variable with 0: Majority group, 1: LGBT group
  1. beta0: log odds ratio (OR) for the majority group who use PC less often; 
  2. beta0 + beta1: log OR for the LGBT group who use PC less often;
  3. beta0 + beta2: log OR for the majority group who use PC more often;
  4. beta0 + beta1 + beta2 + beta3: log OR for the LGBT group who use PC more often

(2) - (1) = beta1: the log OR for LGBT vs. majority group among those who use PC less often.
exp(beta1) = exp(2.166) = 8.72, in question 1, we have obtained the same OR to be 8.72 (b)
This is the effect of (LGBT), while (Use PC more often) is at its baseline.

(3) - (1) = beta2: the log OR for use PC more often vs. less often among majority group
exp(0.356) = 1.43, same as the OR (d) in question 2. 
This is the effect of (Use PC more often), while (LGBT) is at its baseline.

(4) - (2) = beta2 + beta 3: the log OR for use PC more often vs. less often among LGBT
exp(0.356 - 0.714) = 0.699, same as the OR (c) in question 2. 

(4) - (3) = beta1 + beta 3: the log OR for LGBT vs. majority among kids who use PC more often
exp(2.166 - 0.714) = 4.272, same as OR (a) obtained in question 1.

Since the interaction term has a negative estimate (-0.714), we can tell that Using PC more often has a significantly negative effect on risk for LGBT vs. majority. In another word, in the use-PC-less-often group, the risk is significantly higher for LGBT vs. majority, as the answer to question 3 above states.

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting