top of page

What Are the Commonalities of Clustering in Polluted Cities?

Nov 20, 2024

5 min read




Introduction to the Problem

Pollution is a major global concern. Since the introduction of vehicles in 1886, pollution levels have risen by approximately 30% each year, continuing through to 2024. This means that pollution levels before 1886 have escalated dramatically, reaching heights that are nearly unimaginable today. I wanted to investigate what the most polluted cities in America have in common, and for this purpose, I will be using clustering analysis.


What is Clustering?

Clustering is a type of unsupervised learning where patterns in data are identified without relying on predefined labels. It involves grouping data points into clusters, or "categories," based on shared characteristics.

There are two main approaches to clustering:


  1. Hard clustering: Each data point is assigned to a single cluster.

  2. Soft (fuzzy) clustering: Data points can belong to multiple clusters with varying degrees of membership, represented as probabilities.

Two popular clustering algorithms are:

1. K-Means

  • Process: The algorithm begins by selecting random cluster centers (centroids). It then iteratively refines the centroids’ positions by assigning points to the nearest centroid and updating them until they stabilize (or a set number of iterations is reached).

2. Agglomerative Clustering

  • Process: Each data point starts as its cluster. The algorithm successively merges pairs of clusters based on similarity, eventually combining them into a single unified cluster.


Pre-Processing / Data Visualization-Understanding

Thankfully, the data was already well pre-processed. It had various types of information that I thought were useless to my study, so I decided to keep the most prominent information. I eliminated all samples containing NaN values to avoid potential complications with the data. To standardize the dataset, I applied the Standard Scaler and transformed the population size from categorical objects into integers: 0 for small, 1 for medium, and 2 for large. This categorization allows us to effectively identify overpopulated cities, which are likely to experience higher levels of pollution compared to others. I then used PCA and kernel PCA to reduce the data that isn't linear.

Hypothesis:

  • Larger populations often correlate with higher pollution levels due to increased industrial activity, vehicle usage, and energy consumption.

  • Pollution, in turn, can influence the cost of living, health outcomes, and quality of life, which might impact average income indirectly.

Potential Insights (If Pollution Data Were Available):

  1. Population-Pollution Relationship:

    • Large cities with higher populations might experience elevated pollution levels. This could be verified by plotting the population against pollution levels (e.g., air quality index, CO2 emissions).

  2. Pollution and Average Income:

    • Pollution could negatively impact average income by lowering productivity or increasing living costs.

  3. Regional Analysis:

    • Pollution levels might vary significantly across regions. Comparing regions might highlight where population and income are most influenced by pollution.

Models and Their Evaluations

Clustering of U.S. Cities and its Potential Correlation to Pollution

Overview of the Visualization:

  • The scatter plot visualizes clustering results for U.S. cities using PCA (Principal Component Analysis) to reduce the dataset's dimensionality to two components (PCA Component 1 and PCA Component 2).

  • Different clusters are represented by distinct colors (purple, blue, green, yellow).

  • The clusters were derived from features in the dataset, which may indirectly or directly correlate with pollution levels.

Interpretation of Clusters:

  1. Cluster 0 (Purple):

    • Densely packed around the origin.

    • Represents smaller cities with similar socioeconomic and environmental characteristics.

    • This corresponds to cities with lower pollution due to lower population density and reduced industrial activity.

  2. Cluster 1 (Blue):

    • More dispersed and located further along PCA Component 1.

    • Includes larger cities or urban centers with higher variation in features like population and transit scores.

    • These cities exhibit higher pollution levels due to higher population density and industrialization.

  3. Cluster 2 (Green):

    • Spreads across moderate values of PCA Component 1 and PCA Component 2.

    • Represents cities with mid-range population sizes and mixed characteristics.

    • Pollution levels may vary widely within this cluster depending on regional factors.

  4. Cluster 3 (Yellow):

    • Concentrated on the positive side of PCA Component 1 but also spreads vertically.

    • Could represent affluent or highly urbanized cities with higher average incomes, rent, or transit scores.

    • These cities may experience elevated pollution levels due to economic activity and population.

Pollution Context:

It could explain the separation between clusters. Here's how:

  1. Correlation with Population and Industry:

    • Clusters with higher population sizes and transit scores (e.g., Cluster 1) might correlate with increased pollution levels.

  2. Urban vs. Rural Divide:

    • Clusters like Cluster 0 (purple) represent rural areas with less pollution, whereas Clusters 2 and 3 could include cities where urbanization impacts air quality.

  3. Economic Influence:

    • Cities in Cluster 3 (yellow) might balance pollution with policies aimed at mitigating environmental impact (e.g., public transportation), resulting in intermediate pollution levels despite economic activity


What did I Find?

This section synthesizes the insights gathered from the data analysis, clustering visualization, and potential correlations to pollution in U.S. cities. By analyzing socioeconomic factors such as population, average income, rent, commute time, and transit scores, we aimed to uncover patterns and relationships that might explain urban dynamics and their connection to pollution levels.

Key Findings from the Analysis

  1. Socioeconomic Diversity Across Cities:

    • The dataset revealed significant variability in factors like population, income, and living conditions across U.S. cities. Smaller cities tended to have less variation in income and other socioeconomic metrics, while larger metropolitan areas exhibited greater diversity.

  2. Clustering Patterns:

    • Using clustering and dimensionality reduction (PCA), we identified four distinct clusters of cities:

      • Cluster 0 (Purple): Smaller cities with consistent socioeconomic characteristics, likely lower pollution levels due to reduced population density and industrial activity.

      • Cluster 1 (Blue): Larger cities or urban centers with higher variability in features, potentially reflecting elevated pollution levels from industrialization and traffic.

      • Cluster 2 (Green): Medium-sized cities with moderate values in population, income, and commute time, indicating mixed environmental and economic conditions.

      • Cluster 3 (Yellow): Highly urbanized or affluent cities, balancing higher economic activity with potentially greater pollution mitigation strategies.

        Population and Average Income Relationship:

    • Our scatter plot showed no clear correlation between city population and average income. Despite larger cities having higher economic opportunities, income levels were inconsistent, likely influenced by cost of living, job market diversity, and regional economic factors.

  3. Potential Pollution Correlations:

    • While pollution data was not explicitly included, it is well-documented that urbanization and industrial activity increase pollution levels. Cities in Clusters 1 and 3 (larger, more economically active areas) are likely to face higher pollution challenges.

Insights on Pollution and Urban Dynamics

  1. Population Growth and Pollution:

    • Higher populations often correlate with increased pollution due to vehicle emissions, energy demand, and industrial activity. This trend aligns with findings in urban studies, where cities like Los Angeles and New York face significant air quality issues despite economic affluence.

  2. Economic Inequality and Pollution Impact:

    • Cities with high-income inequality often experience uneven pollution exposure. For example, lower-income communities tend to reside closer to industrial areas or highways, amplifying their exposure to air and noise pollution (as seen in studies like the Environmental Inequality Hypothesis).

  3. Transit Scores and Sustainability:

    • Cities with higher transit scores may mitigate pollution levels by reducing vehicle dependency. Clusters representing urbanized regions could benefit from expanding public transportation networks to balance economic growth with environmental sustainability.

  4. Outliers and Regional Differences:

    • Outlier cities with extreme values in population or income merit further investigation. Factors such as regional industry specialization (e.g., tech hubs like San Francisco, and manufacturing-heavy cities like Detroit) and geographical features (mountain valleys trapping smog) influence pollution patterns.

Policy and Research Implications

  1. Targeted Pollution Mitigation:

    • Clusters with higher economic activity and population density (e.g., Clusters 1 and 3) should prioritize pollution mitigation strategies such as green energy, emissions controls, and urban green spaces.

  2. Data-Driven Urban Planning:

    • Cities in Cluster 0 (smaller, consistent features) can use their lower pollution levels as benchmarks for sustainability, while Cluster 1 cities could address their higher variability through tailored policies.

  3. Further Research Needs:

    • Future studies should include explicit pollution data (e.g., particulate matter levels, air quality indices) to quantify correlations with socioeconomic factors. Satellite data and localized air quality sensors could refine these analyses.

Concluding Remarks

This analysis highlights the complex interplay between socioeconomic factors, urbanization, and potential pollution outcomes in U.S. cities. By clustering cities based on key features and considering their potential environmental impact, we underscore the need for data-driven policies to promote sustainable urban development.

Resources:

Work: https://www.mediafire.com/file/ckk3j12gxaxnwwl/Project4_Clustering(FINALIZED_JULIO).ipynb/file

Source1: https://www.stateofglobalair.org/resources/health-in-cities

Source2: https://www.epa.gov/air-trends/air-quality-cities-and-counties

Nov 20, 2024

5 min read

0

27

0

Related Posts

Comments

Share Your ThoughtsBe the first to write a comment.
bottom of page