This document studies which weather events are most harmful and economically damaging to population. It uses U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States. The results of this report can be used by public decision-makers who are responsible for preparing for severe weather events and will need to prioritize resources for different types of events. The results show that floods, tornadoes and hurricanes are the most dangerous weather events in terms of public health and economic impact.
We load data from the csv file using bzfile and read.csv() function.
data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
It is important to process and normalize the EVTTYPE field because there are a lot of typos and values written in upper or lower cases. There are too many of these problems, we only show an example how to clean it and create EVTYPE2 variable to store corrected data.
data_clean <- data[!is.na(data$EVTYPE) & !is.na(data$FATALITIES)
& !is.na(data$INJURIES), ]
data_clean$EVTYPE2 <- toupper(data_clean$EVTYPE)
data_clean$EVTYPE2 <- gsub("COLD AIR FUNNELS", "COLD AIR FUNNEL",
data_clean$EVTYPE2)
data_clean$EVTYPE2 <- gsub("SEVERE THUNDERSTORMS", "SEVERE THUNDERSTORMS",
data_clean$EVTYPE2)
# we replace 'h' to 'H', 'm' to 'M', 'k' to 'K' to normalize
# DAMAGE exponents in PROPDMGEXP and CROPDMGEXP
data_clean$PROPDMGEXP <- gsub("h", "H", data_clean$PROPDMGEXP)
data_clean$PROPDMGEXP <- gsub("m", "M", data_clean$PROPDMGEXP)
data_clean$PROPDMGEXP <- gsub("k", "K", data_clean$PROPDMGEXP)
data_clean$CROPDMGEXP <- gsub("h", "H", data_clean$CROPDMGEXP)
data_clean$CROPDMGEXP <- gsub("m", "M", data_clean$CROPDMGEXP)
data_clean$CROPDMGEXP <- gsub("k", "K", data_clean$CROPDMGEXP)
We need to identify the total damage by multiplying PROPDMG*PROPDMGEXP. We create a function for that.
# Function to convert PROPDMGEXP to the corresponding multiplier
convert_multiplier <- function(exp) {
switch(exp,
"H" = 100,
"K" = 1000,
"M" = 1e6,
"B" = 1e9,
1) # we multiply by 1 if other strings found in EXP
}
# We apply the function to calculate the total property damage.
# We normalize it on Million and round to tenth.
data_clean$ECON_DAMAGE <- round((data$PROPDMG *
sapply(data$PROPDMGEXP, convert_multiplier)+
data$CROPDMG * sapply(data$CROPDMGEXP, convert_multiplier))/1000000, 1)
We create a new field HEALTH_DAMAGE to store FATALITIES+INJURIES data, because together they constitute an indicator of danger to public health.
data_clean$HEALTH_DAMAGE <- data_clean$FATALITIES+data_clean$INJURIES
First we answer this question: Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health?
library(ggplot2)
health_damage <- aggregate(HEALTH_DAMAGE~EVTYPE2, data=data_clean, sum)
health_damage_ordered <- health_damage[order(-health_damage$HEALTH_DAMAGE),]
# we only want to see top 20 rows
top_20_damage <- head(health_damage_ordered, 20)
# Create the bar plot using ggplot2
ggplot(top_20_damage, aes(x = reorder(EVTYPE2, -HEALTH_DAMAGE), y = HEALTH_DAMAGE)) +
geom_bar(stat = "identity", fill = "lightblue", color = "blue") +
labs(title = "Top 20 Total Health Damage by Event Type",
subtitle="Event types are sorted according to the total health damage
(fatality and injury damage) in counts.",
x = "Event Type",
y = "Total Health Damage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
From the analysis we see that TORNADO event types have the highest total FATALITY and INJURIES.
econ_damage <- aggregate(ECON_DAMAGE~EVTYPE2, data=data_clean, sum)
econ_damage_ordered <- econ_damage[order(-econ_damage$ECON_DAMAGE),]
# we only want to see top 20 rows
top_20_econ_damage <- head(econ_damage_ordered, 20)
# Create the bar plot using ggplot2
ggplot(top_20_econ_damage, aes(x = reorder(EVTYPE2, -ECON_DAMAGE), y = ECON_DAMAGE)) +
geom_bar(stat = "identity", fill = "lightblue", color = "blue") +
labs(title = "Top 20 Total Economic Damage by Event Type",
subtitle="Event types are sorted according to the total economic damage
(property and crop damage) in million USDs.",
x = "Event Type",
y = "Total Economic Damage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
From the analysis we see that FLOOD, HURRICANE, TORNADO event types have the highest total FATALITY and INJURIES.
This report is limited in its findings because EVTYPE field is not fully normalized and there are many duplicates and misspellings in the values of the field. The approach taken here is used for the purpose of assignment only.