With evictions prominent in recent news due to Coronavirus, exploring previous patterns of evictions, especially locating hotspots is worthwhile to provide context for the eviction moratoriums and which communities will likely need the most help going forward. Analyzing NYC eviction data with census data, neighborhood eviction rates can be more informative and potential arguments (income, race, ownership) can be addressed. The following visualizations will show three types of census tracts that have particularly high levels of evictions per capita: homeless populations congregating in parks, certain trendy high end neighborhoods and certain poor minority neighborhoods particularly in the Bronx and Brooklyn.
For a better, more interactive look at the visualizations, click the hyperlink below each chart to open it in Tableau.
Before exploring the results, a quick overview of the data generation is in order. The Eviction data comes from NYC Open Data and while the data covers the entirety of 2017 to the end of 2020, it is important to remember that the majority of 2020 saw no evictions in NYC once Coronavirus hit. The addresses in this dataset were cleaned using the Usaddress module before using GeoPandas to generate geolocations. More than 99% of evictions were successfully cleaned and geo-located.
This code uses the Usadddress module to assign labels to each part of the address and only returns the important parts (good_vars) for the core address necessary for geolocating. If it fails, it returns a null value. While not shown here, there were many errors that were manually adjusted to reduce the number of null values returned.
new_address = ''
for var in good_vars:
if var in list(raw_dict.keys()):
if raw_dict[var] !=np.nan:
While the code for geolocating was simple, it took a long time to process with free resources. The main code is wrapped in a try clause to avoid errors cancelling the lengthy process. Null values were tested again with ArcGIS and then with Nominatim.
raw = geocoder.arcgis(address)
if raw.status !='OK':
location = (raw.lat,raw.lng)
The Census data came from a mix of census datasets with an impressive list of variables and were narrowed down to key, traditional metrics like ethnicity, income, ownership, age and gender. This project used 2018 census estimates as the most recently available year for census tract data. While 2017 data was available, the value added by including values from different years is offset by the difficulties of generating values for 2019 and 2020.
The two datasets were combined using GeoPandas by connecting the geometry of the NYC census tracts appended onto the census data with the evictions dataset. I opted for two forms; one for just census tracts and one using census tract months to allow for selecting a specific time frame in the Tableau dashboard.
Joining based on geolocation is easy with GeoPandas. In this case, geo_eviction_data includes the data from the original evictions dataset while tract_map_merged includes census data joined onto the census tract map for NYC.
eviction_geo_tracts = geopandas.sjoin(geo_eviction_data, tract_map_merged, how="left", op='within',rsuffix='_tract')
In regards to both evictions and evictions per capita, most census tracts in NYC have very few evictions. The evictions variable naturally follows a Poisson distribution because it is a count of occurrences within a finite time even if it doesn’t meet assumptions such as a constant event time period and independent occurrences. The evictions per capita variable drops off even more dramatically. While there are no visible bars for higher values, there are still cases with values that high. What is clear though is that there is a small set of evictions that have unusually high rates of evictions.
Homeless evictions are particularly dramatic with census tract 228 in Richmond County (Staten Island) having 2 evictions for every 5 people the census estimates live there. This census tract and some other hotspots such as census tract 438 in the Bronx (1 eviction for every 13 people) are parks where residents are generally squatters and have few means to avoid evictions. This rate is abnormally high and easily attributed to homeless squatters because it is park land with next to no formal population. For example, census tract 228 in Richmond County has about 12 people living there according to the census.
While there are many other parks that do not show high eviction per capita rates that is due to the absence of a formal population value. Census tracts with missing data are not random and the lack of a population value results in their omission from the analysis of per capita evictions. These parks and other unique census tracts with high eviction rates do not particularly have excessive evictions, but have incredibly small populations.
Upscale Neighborhood Evictions:
The most surprising result and most difficult to explain from the data is why there is a cluster of high eviction rates in census tracts in the Garment District, the Diamond District, and parts of Midtown East (census tracts 94, 96, 102, 109 and 113). These are neighborhoods with high median incomes but are not that fundamentally different from other Manhattan census tracts with low rates of evictions.
Several dynamics could possibly explain the anomalies. Because the neighborhoods are trendy, rent is probably higher even if incomes are similar to other Manhattan neighborhoods. This can encourage landlords to evict lower paying tenants to make room for new, higher paying ones. Renters may also be living beyond their means for the status of living there. The high eviction rate could also be attributed to commercial evictions from businesses that pop up to try to cater to the trendy neighborhoods. Again, it would be difficult to isolate any particular reason, but these 5–8 census tracts have uniquely high eviction rates for Manhattan, let alone the entire NYC area. As shown in the graphs in the dashboard, census tract characteristics poorly explain why these census tracts in particular have high eviction rates.
Poor Minority Eviction Clusters:
The last subset of high evictions per capita is in particular areas of the Bronx, Brooklyn and, to a much lesser extent, Queens, which have a low white population percentage and a low median income. While these census tracts are not particularly poorer or less white than other census tracts, the clustering certainly suggests that there is something driving evictions not captured in the broad census tract features used here. These clusters are somewhat anomalous and while the number of evictions per capita is not as high as in parks and particular census tracts in Manhattan, these clusters of eviction prone areas are likely the most important finding of this project. Without a better understanding of more nuanced neighborhood characteristics it is difficult to suggest a particular explanation.
While a look at evictions per capita does identify these three types of census tracts as having particularly high eviction rates, they are not inherently unique from other census tracts. As will be explained in more detail in the following section, the particularly high rates of evictions per capita are not well explained by the census features used even if there are general trends. Including these outliers severely reduced model performance values such as the coefficient of determination (r²).
To gauge the extent to which variables like ownership, white population percentage and median income can be used to predict evictions per capita, a simple linear regression is sufficient. While fancier models would likely make better predictions, the readability of linear regression models is valuable even if formal assumptions are not met.
One caveat to modeling this data is that the data is heavily skewed to lower values. The data is naturally unbalanced and while there are sampling methods to minimize the impact, some form of bias will remain. I find it better to acknowledge the bias of overpredicting higher and rarer eviction per capita rates than model using a sample that weighs those values more. Even when applying a random forest or gradient boosting models, there are chronic prediction errors for higher values.
While there are alternative variables that could be used such as using poverty metrics instead of median income or other ethnic or racial categories instead of just white, the parsimony of using just one variable for each category is worthwhile for understanding the general impact of each category. For example, poverty rates and median income are highly correlated (-.74) and white population percentage is highly correlated with black population percentage (-.73). Capturing the majority of the variance in the fewest variables is worthwhile for readability and avoids multicollinearity.
In addition to income, ownership and white population percentage, it is good practice to include regional dummy variables for each of the boroughs in order to capture the quirks beyond differences in the three main variables. Within the model, Bronx is the default borough to which others are compared.
There is more modeling and data preparation available in the GitHub page, but the main steps worth noting are that a VIF test is performed to show that multicollinearity is not too high, that features are slightly winsorized and most importantly, the high eviction rates generally discussed above are removed as outliers (any value above .05) . While these outliers are important, they are statistical anomalies which will only weaken the model which is aimed at explaining the larger population of census tracts. Even then, the model poorly predicts cases of higher evictions per capita because there are still very few cases even when the highest value is no more than .05 instead of .34.
While I prefer Scikit-Learn pipelines for my modeling, Statsmodel has the convenience of displaying P values to demonstrate statistical significance. The model shows that all variables are statistically significant except for median household income . All boroughs are statistically significant, though Staten Island is just short of 99% confidence. While the values are small, when the mean eviction per capita rate is ~.008, .004 is still significant.
Regarding the main variables, a 100% white census tract would likely see .01 less evictions per capita than a 0% white census tract. A census tract with only owner housing would see .008 less eviction per capita than a census tract with no owner housing. Median income is not statistically significant. Again, while these are very small changes, especially considering most census tracts are not at the extremes of white population percentage and owner housing percentage, the direction is clear and unsurprising. Census tracts that are less white and have fewer owned dwellings see higher rates of evictions. It is surprising that median income is not statistically significant, but that is likely because both poor and rich neighborhoods generally have low eviction per capita rates.
Again, this trend is largely applicable for census tracts that do not see high rates of evictions. The anomalies that were discussed earlier were generally not included in this model. While the anomalies are important as hotspots, it is still important to examine the general trend for the larger population of the approximately 2200 census tracts in NYC.
While this project poorly predicts for neighborhoods with high eviction rates, the visualizations and the model emphasize that neighborhoods with high rates are generally rare, especially when they are not sparsely inhabited parks and other census tracts with next to no formal population. There are clearly other factors, likely unique local characteristics, driving particular populated census tracts to have high rates of evictions that traditional census metrics are not capturing. A formal eviction is inherently rare under normal circumstances due to the long, complex process and the series of decisions made by multiple parties to reach that point.
This is a good first attempt at exploring where evictions are and by using census data it provides more context than just the eviction data alone. However, there are likely other data sources that can better capture eviction vulnerability such as voluntary evictions. For those that are interested in further exploring which locations are at risk for evictions, I would encourage experimenting with alternative census features, utilizing new data sources and redefining the problem to better capture the macro trends creating conditions for evictions and finding an alternative to evictions that better captures less random outcomes.