2 min read

Differences within genders are greater than those between genders

As Titled. Borrowing the (NHANES)[https://www.cdc.gov/nchs/nhanes/index.htm] for this one as the licensing is about 1000% easier to deal with than Health Survey for England. Years 2009-10 and 2011-12 are in the {NHANES} package.

Partially this is about reducing a distribution to a point. You can say “the average man is taller than the average woman”.

data = NHANES::NHANESraw %>%  
  filter(!is.na(Height)) %>% 
  filter(Age>=18) #Adults for simplicity

data %>% 
  group_by(Gender) %>% 
  summarise(`Height (m)` = weighted.mean(Height, WTINT2YR)/100) %>% 
  knitr::kable(digits = 2)
Gender Height (m)
female 1.62
male 1.76

But the distribution of heights looks more like this:

data %>% 
  mutate(Height = Height/100) %>%  #cm to m
  ggplot(aes(x = Height, colour = Gender, weight = WTINT2YR)) + #geom_density accepts survey weight as an aes!
    geom_density() +
    ggthemes::scale_colour_few()

i.e. being below 1.62m & male is possible, being over 1.76m and female is possible.

data %>% 
  filter(Gender == "female") %>% 
  mutate(tall = Height > 176) %>% 
  summarise(tall_women  = weighted.mean(tall, WTINT2YR)) %>% 
  mutate(tall_women = scales::percent(tall_women, accuracy = 0.1)) %>% 
  knitr::kable()
tall_women
2.7%
data %>% 
  filter(Gender == "male") %>% 
  mutate(short = Height < 162) %>% 
  summarise(short_men  = weighted.mean(short, WTINT2YR)) %>% 
  mutate(short_men = scales::percent(short_men, accuracy = 0.1)) %>% 
  knitr::kable()
short_men
3.5%

As a contrived example, we could have a 50 men and 50 women on a double-decker bus and 1 or 2 men would be below the female average height, and 1 or 2 women would be above male average height.

A very contrived example, American data and public transit hasn’t really caught on there.

While I poked the {ggplot2} (geom_density documentation)[https://ggplot2.tidyverse.org/reference/geom_density.html], I saw that position=fill gives conditional densities. e.g. if I plucked a random person and knew their height, how likely is it that they’re male/female?

data %>% 
  ggplot(aes(x=Height, fill=Gender, weight=WTINT2YR)) +
  geom_density(position="fill") +
  ggthemes::scale_fill_few() +
  scale_y_continuous(labels=scales::percent)

Either I’ve ducked that one up, or there’s a bump there that corresponds to something significant in feet and inches.

To really hammer the point, if I don’t split by gender then the distribution is not bi-modal.

data %>% 
  ggplot(aes(x=Height)) + geom_density()

There’s not a really obvious “male” peak and an obvious “female” peak.

This stuff matters when you can’t reasonably assume that someone is approximately average. In basketball, height helps. Female athletes who are just as many \(\sigma\) from the average as the male athletes don’t deserve the sh!t they get for being tall.

And if a study said “the average response was …” it’s totally expected that your response might be plus or minus a \(\sigma\) or two.