How to manually change the order of legend and plot means and error bars

It is a good practice to let the legend follow the trend of the data. Like in the figure below, the legend is also ordered from "Top 20" to "Bottom 80%".
Another example in which the legend is ordered from high to low by the values in 2010:

The rest are technical notes.


Depends on how to code is organized in ggplot2, there are different ways to change the order:

1. Sometimes just reorder the levels of the factor is enough:
# reorder the factor
m4_mean_all_2$Group <- factor(m4_mean_all_2$Group,
                              levels =c('Top 20%','In Total', 'Bottom 80%'))

# plot the top20 by group
ggplot(data=m4_mean_all_2, aes(x=year, y=wm, group=Group,
                               color=Group, shape=Group))+
  geom_point(size=4)+
  geom_line(size=1)+
  geom_errorbar(aes(ymin=CI_low, ymax=CI_high), width=.2)+
  ggtitle("Urban Density Change Over Time in Colombia") +
  labs(x="Year", y="Average City Density", color = "", shape="")

# Moreover, we can use  labs(x="Year", y="Average City Density", color = "", shape="") to suppress the dafault legend by color and shape seperately.

2. Sometimes relevel the factor doesn't work, then we need to manually assign it using "scale_..._discrete(breaks = ...)" in ggplot.
Maybe it is because in this case the statistics shown in the figure are obtained using "stat_summary " directly.
# manually assigned order
region_order <- c("Land-Rich Developed Countries (LRDC)",
                         "Sub-Saharan Africa (SSA)" ,
                         "Western Asia and North Africa (WANA)",
                         "East Asia and the Pacific (EAP)" ,
                         "Latin America and the Caribbean (LAC)",
                         "Southeast Asia (SEA)" ,
                         "Europe and Japan (E&J)" ,
                         "South and Central Asia (SCA)"
                         )

ggplot(data=univ_long, aes(year, Population, group=Region, color=Region, shape=Region))+
  stat_summary(fun.y = median, geom='line') +
  stat_summary(fun.y = median, geom='point', size = 3) +
  labs(x="Year", y="Median City Population")+
  scale_color_discrete(breaks = region_order) +
  scale_shape_discrete(breaks = region_order)

How to prepare the dataset:
The helper function in this cookbook is very helpful. It talks about how to use ddply in a function to return the statistics we need. And dataframe must be supported to this function. Variables shall be supplied with a quote: "var" as arguments.
c("year", "PDETS") is the same as .(year, PDETS) as an argument.

Example

# a self-defined function to calculated weighted mean, sd, se, CIs, could be improved as I set na.rm = T.
# a function to calculated weighted mean, sd, se, CIs
weighted_Output <- function(x, w){
  N = sum(!is.na(x))
  wm = weighted.mean(x, w, na.rm = T) # weighted mean
  m = mean(x, na.rm = T) 
  m_of_w = mean(w, na.rm = T) # mean of the weights
  median = median(x, na.rm = T)
  if (N > 1) {
    sd = sqrt(sum(w*(x - wm)^2, na.rm = T)/(N-1)/m_of_w)
  } else {
    sd = NaN
  }
  se = sd/sqrt(N)
  CI_low = wm - qt(0.975, df=N)*se
  CI_high = wm + qt(0.975, df=N)*se
  return(list(n=N, m=m, wm=wm, sd=sd, se=se, CI_low = CI_low, CI_high = CI_high))
}
# weighted_Output(Sample200$DensityUET3, Sample200$UrbanExtentT3 )

# it is indeed complicated to supply variables to ddply, must supply df.
# output e.g.: Region  n     mean       sd   CI_low  CI_high
get_WM_and_CI <- function(dataset, fac, var, w){
  CI_output <- ddply(dataset, fac, 
                     function(dataset, var){
                       w_output = weighted_Output(dataset[,var], dataset[,w])
                       n = w_output$n
                       wm = w_output$wm
                       sd = w_output$sd
                       se = w_output$se
                       CI_low = w_output$CI_low
                       CI_high = w_output$CI_high
                       return(c(n, wm, sd, se, CI_low, CI_high))},
                     var)
  # This is for 0 or only 1 fac. variable
  names(CI_output) <- c('Region', 'n', 'wm', 'sd', 'se', 'CI_low', 'CI_high')
  return(CI_output)
}

# for >1 groups, print name directly
get_WM_and_CI2 <- function(dataset, fac, var, w){
  CI_output <- ddply(dataset, fac, 
                     function(dataset, var){
                       w_output = weighted_Output(dataset[,var], dataset[,w])
                       n = w_output$n
                       wm = w_output$wm
                       sd = w_output$sd
                       se = w_output$se
                       CI_low = w_output$CI_low
                       CI_high = w_output$CI_high
                       return(c(n, wm, sd, se, CI_low, CI_high))},
                     var)
  names(CI_output) <- c(paste(fac),'n', 'wm', 'sd', 'se', 'CI_low', 'CI_high')
  return(CI_output)
}

m4_mean1 <- get_WM_and_CI2(Co_long, c("year", "PDETS"), "Density", "UE")
In this way, just rewrote all the codes for figures in Colombia paper.

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

The Weighted Standard Deviation

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting