How to manually change the order of legend and plot means and error bars

It is a good practice to let the legend follow the trend of the data. Like in the figure below, the legend is also ordered from "Top 20" to "Bottom 80%".
Another example in which the legend is ordered from high to low by the values in 2010:

The rest are technical notes.


Depends on how to code is organized in ggplot2, there are different ways to change the order:

1. Sometimes just reorder the levels of the factor is enough:
# reorder the factor
m4_mean_all_2$Group <- factor(m4_mean_all_2$Group,
                              levels =c('Top 20%','In Total', 'Bottom 80%'))

# plot the top20 by group
ggplot(data=m4_mean_all_2, aes(x=year, y=wm, group=Group,
                               color=Group, shape=Group))+
  geom_point(size=4)+
  geom_line(size=1)+
  geom_errorbar(aes(ymin=CI_low, ymax=CI_high), width=.2)+
  ggtitle("Urban Density Change Over Time in Colombia") +
  labs(x="Year", y="Average City Density", color = "", shape="")

# Moreover, we can use  labs(x="Year", y="Average City Density", color = "", shape="") to suppress the dafault legend by color and shape seperately.

2. Sometimes relevel the factor doesn't work, then we need to manually assign it using "scale_..._discrete(breaks = ...)" in ggplot.
Maybe it is because in this case the statistics shown in the figure are obtained using "stat_summary " directly.
# manually assigned order
region_order <- c("Land-Rich Developed Countries (LRDC)",
                         "Sub-Saharan Africa (SSA)" ,
                         "Western Asia and North Africa (WANA)",
                         "East Asia and the Pacific (EAP)" ,
                         "Latin America and the Caribbean (LAC)",
                         "Southeast Asia (SEA)" ,
                         "Europe and Japan (E&J)" ,
                         "South and Central Asia (SCA)"
                         )

ggplot(data=univ_long, aes(year, Population, group=Region, color=Region, shape=Region))+
  stat_summary(fun.y = median, geom='line') +
  stat_summary(fun.y = median, geom='point', size = 3) +
  labs(x="Year", y="Median City Population")+
  scale_color_discrete(breaks = region_order) +
  scale_shape_discrete(breaks = region_order)

How to prepare the dataset:
The helper function in this cookbook is very helpful. It talks about how to use ddply in a function to return the statistics we need. And dataframe must be supported to this function. Variables shall be supplied with a quote: "var" as arguments.
c("year", "PDETS") is the same as .(year, PDETS) as an argument.

Example

# a self-defined function to calculated weighted mean, sd, se, CIs, could be improved as I set na.rm = T.
# a function to calculated weighted mean, sd, se, CIs
weighted_Output <- function(x, w){
  N = sum(!is.na(x))
  wm = weighted.mean(x, w, na.rm = T) # weighted mean
  m = mean(x, na.rm = T) 
  m_of_w = mean(w, na.rm = T) # mean of the weights
  median = median(x, na.rm = T)
  if (N > 1) {
    sd = sqrt(sum(w*(x - wm)^2, na.rm = T)/(N-1)/m_of_w)
  } else {
    sd = NaN
  }
  se = sd/sqrt(N)
  CI_low = wm - qt(0.975, df=N)*se
  CI_high = wm + qt(0.975, df=N)*se
  return(list(n=N, m=m, wm=wm, sd=sd, se=se, CI_low = CI_low, CI_high = CI_high))
}
# weighted_Output(Sample200$DensityUET3, Sample200$UrbanExtentT3 )

# it is indeed complicated to supply variables to ddply, must supply df.
# output e.g.: Region  n     mean       sd   CI_low  CI_high
get_WM_and_CI <- function(dataset, fac, var, w){
  CI_output <- ddply(dataset, fac, 
                     function(dataset, var){
                       w_output = weighted_Output(dataset[,var], dataset[,w])
                       n = w_output$n
                       wm = w_output$wm
                       sd = w_output$sd
                       se = w_output$se
                       CI_low = w_output$CI_low
                       CI_high = w_output$CI_high
                       return(c(n, wm, sd, se, CI_low, CI_high))},
                     var)
  # This is for 0 or only 1 fac. variable
  names(CI_output) <- c('Region', 'n', 'wm', 'sd', 'se', 'CI_low', 'CI_high')
  return(CI_output)
}

# for >1 groups, print name directly
get_WM_and_CI2 <- function(dataset, fac, var, w){
  CI_output <- ddply(dataset, fac, 
                     function(dataset, var){
                       w_output = weighted_Output(dataset[,var], dataset[,w])
                       n = w_output$n
                       wm = w_output$wm
                       sd = w_output$sd
                       se = w_output$se
                       CI_low = w_output$CI_low
                       CI_high = w_output$CI_high
                       return(c(n, wm, sd, se, CI_low, CI_high))},
                     var)
  names(CI_output) <- c(paste(fac),'n', 'wm', 'sd', 'se', 'CI_low', 'CI_high')
  return(CI_output)
}

m4_mean1 <- get_WM_and_CI2(Co_long, c("year", "PDETS"), "Density", "UE")
In this way, just rewrote all the codes for figures in Colombia paper.

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting