Facebook
From Erica, 2 Months ago, written in R.
Embed
Download Paste or View Raw
Hits: 193
  1. ---
  2. title: "HW2"
  3. author: "Erica Slogar"
  4. date: "02-27-24"
  5. output: pdf_document
  6. ---
  7.  
  8. # Set up
  9.  
  10. Move this script and the `cnty_COVID_cases.csv` dataset into a folder and then create a methods/html/new.html">new RProject in that folder. This step sets the working directory to the correct location to make loading data easier.
  11.  
  12. Load the tidyverse package here. Remember to hide the package loading messages.
  13.  
  14. ```{r,message=FALSE}
  15. library(tidyverse)
  16. ```
  17.  
  18.  
  19. Now load the `cnty_COVID_cases.csv` dataset. Remember to hide the data loading messages. This methods/html/is.html">is a dataset from August 2020 about the number of methods/html/new.html">new COVID cases in the past 90days by county. It methods/html/is.html">is an excerpt from my colleague's paper about the [effect of altitude on COVID prevalence](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0245055). Reading the paper is not required for this assignment.
  20.  
  21. ```{r,message=FALSE}
  22. cov_cnty <- read_csv("cnty_COVID_cases.csv")
  23. ```
  24.  
  25. # 1.
  26.  
  27. ## 1A. Which State has the fewest rows (counties)?
  28.  
  29. > Hint: use `count()` and `arrange()`
  30.  
  31. ```{r}
  32. #summary(cov_cnty)
  33. cov_cnty %>%
  34.  count(State) %>%
  35.  arrange(n) %>%
  36.  head(n = 1)
  37. ```
  38.  
  39. ## 1B. Which State has the most rows (counties)?
  40.  
  41. ```{r}
  42. cov_cnty %>%
  43.  count(State) %>%
  44.  arrange(-n) %>%
  45.  head(n = 1)
  46. ```
  47.  
  48. # 2. Show the 12 Counties that have Population over 2 million
  49.  
  50. > Hint: use `filter()`
  51.  
  52. ```{r}
  53. cov_cnty %>%
  54.  filter(Population > 2000000) %>%
  55.  select(County_name, Population) %>%
  56.  head(n = 12)
  57. ```
  58.  
  59. # 3. Which States have Counties with Population over 2 million
  60.  
  61. ```{r}
  62. cov_cnty %>%
  63.  filter(Population > 2000000) %>%
  64.  select(State, County_name, Population) %>%
  65.  distinct(State)
  66. ```
  67.  
  68. # 4.
  69.  
  70. ## 4A. Create a plot showing Population on X and Confirmed_90d on Y.
  71.  
  72. ```{r}
  73. cov_cnty %>%
  74.  ggplot(aes(x = Population, y = Confirmed_90d)) +
  75.  geom_point(aes(color = State))
  76. ```
  77.  
  78. ## 4B. Describe what you see.
  79.  
  80. ```{r}
  81. # I see a scatter plot where each point represents a confirmed COVID case based
  82. # on population size. Most of the points are concentrated near the origin.
  83. ```
  84.  
  85. # 5.
  86.  
  87. ## 5A. Create the same plot as #4, but using log(Population) as X and log(Confirmed_90d + 1) as Y.
  88.  
  89. ```{r}
  90. cov_cnty %>%
  91.  ggplot(aes(x = log(Population), y = log(Confirmed_90d + 1))) +
  92.  geom_point(aes(color = State))
  93. ```
  94.  
  95. ## 5B. Describe what you see. What was the effect of taking the log?
  96.  
  97. ```{r}
  98. # I see a scatter plot similar to #4, but this graph differs in that the log
  99. # makes the scatter plot linear.
  100. ```
  101.  
  102. ## 5C. Why did we +1 in log(Confirmed_90d + 1)?
  103.  
  104. ```{r}
  105. # A log cannot be taken of a zero value, so we add +1 to make the values nonzero.
  106. ```
  107.  
  108. # 6. Create two new variables that show:
  109.  
  110. ## 6A. New confirmed covid cases in the past 90 days per 1000 people (cases_1k) and
  111.  
  112. ```{r}
  113. cov_cnty %>%
  114.  mutate(cases_1k = (Confirmed_90d / Population) * 1000,
  115.         log_cases_1k = log(cases_1k + 0.1))
  116.  
  117. covid <- cov_cnty %>%
  118.  mutate(cases_1k = (Confirmed_90d / Population) * 1000,
  119.         log_cases_1k = log(cases_1k + 0.1))
  120. ```
  121.  
  122. ## 6B. Log-transformed cases_1k
  123.  
  124. > In case we did not get to this in class, the code is here for you. Run it to see what it does.
  125.  
  126. ```{r}
  127. covid %>%
  128.  mutate(cases_1k = (Confirmed_90d / Population) * 1000,
  129.         log_cases_1k = log(cases_1k + 0.1))
  130.  
  131. # save the new variables onto the dataset
  132. covid <- covid %>%
  133.  mutate(cases_1k = (Confirmed_90d / Population) * 1000,
  134.         log_cases_1k = log(cases_1k + 0.1))
  135. ```
  136.  
  137. # 7.
  138.  
  139. ## 7A. The RUCC_2013 variable is us the Rural-Urban Continuum Code, in which more urban counties have lower numbers and more rural counties have higher numbers. Plot a boxplot for each RUCC_2013 category on x and the log-transformed new cases of covid per 1000 people on y (`log_cases_1k`).
  140.  
  141. > Hint: use `aes(group = RUCC_2013)` inside `geom_boxplot()` to plot RUCC_2013 as groups.
  142.  
  143. ```{r}
  144. covid %>%
  145.  ggplot(aes(x = (RUCC_2013), y = (log_cases_1k))) +
  146.  geom_boxplot(aes(group = RUCC_2013))
  147. ```
  148.  
  149. ## 7B. Describe what you see
  150.  
  151. ```{r}
  152. # I see 9 box plots where the median remains under 2.5k.
  153. ```
  154.  
  155. # 8. When we divided the new cases by the population, that was a form of normalization. What are the benefits and the detriments of this step?
  156.  
  157. ```{r}
  158. # asdf
  159. ```