Infection by human immunodeficiency virus (HIV) may lead to acquired immunodeficiency syndrome (AIDS) and continues to be a global public health problem, with an estimated 37 million individuals worldwide who are HIV-positive. New York City has the oldest and largest HIV epidemic in the United States history, and it also so leads the nation in the number of new HIV cases nowadays.
In this project, we aim to study the relationships between different characteristics of patients and the HIV/AIDS Diagnosis Outcome in New York City from 2011 to 2015.
Data Sources
Main data
HIV/AIDS Diagnoses by Neighborhood, Age Group, and Race/Ethnicity from NYC open data. [website]
Data discription:
Data reported to the HIV Epidemiology and Field Services Program by June 30, 2016. All data shown are for people ages 13 and older. Borough-wide and citywide totals may include cases assigned to a borough with an unknown UHF or assigned to NYC with an unknown borough, respectively. Therefore, UHF totals may not sum to borough totals and borough totals may not sum to citywide totals."
Other data
Shapefile of NYC Zip Codes - tabulation areas provided by NYC Department of Information Technology & Telecommunications (DOITT) [website]
Zip code of United Hospital Fund neighborhood [website]
2011-2015 American Community Survey (ACS) Public Use Microdata Sample (PUMS). [website]
PUMS data wrangling:
The raw dataset is super large with hundreds of variables, so we select it based on our target variables - location and total person income. The selected dataset is saved as “selected_pums.csv” in the data folder under our R project. The location data from 2011 ACS is based on Public Use Microdata Area Code (PUMA) 2000, while the definition for PUMA 2000 is nowhere to be found. This is why we exclude the income data from 2011.
For the 2012-2015 PUMS data, we transfer the PUMA 2010 into zipcode for a better visualization on the NYC map. The transform is based on the following two datasets.
After archiving our main data “HIV/AIDS Diagnoses by Neighborhood, Age Group, and Race/Ethnicity” from NYC open data, we tidied it using R. Multiple linear regression models are fitted around two outcomes: The number of HIV diagnosis and Death rate related to HIV. Three Geomaps were created to show the geographical distribution and differences of HIV diagnosis, HIV rate and income between each United Hospital Funded Neighborhood.