Clustering

Clustering is an unsupervised machine learning technique with a lot of applications in the areas of pattern recognition, image analysis, customer analytics, market segmentation, social network analysis, and more. A broad range of industries use clustering, from airlines to healthcare and beyond.

Partitional clustering: A clustering technique that divides the data points into non-overlapping clusters such that each data point belongs to exactly one cluster. It tries to learn from the input data a partition that optimizes a certain clustering criterion. Examples include k-means, mixture models, and spectral clustering.

Hierarchical clustering:

A clustering technique that builds a hierarchy of clusters in a tree-like structure called a dendrogram. It successively merges similar clusters or successively splits clusters based on similarity. It does not require specifying the number of clusters a priori. Examples include agglomerative and divisive hierarchical clustering.

Clustering national parks based on their characteristics can provide useful insights to help forecast future park visitation in several ways:

Identify similar groups of parks: By using clustering techniques like k-means and hierarchical clustering, we can identify parks that are similar to one another based on features like popularity, elevation, activities, reviews etc. Knowing park similarity groups simplifies the forecasting task.
Forecast at cluster level: Rather than building separate forecasting models for each individual park, we can build models to forecast visitation for a cluster of similar parks. This is more efficient and improves model accuracy since more training data is available.
Infer park seasonality patterns: Parks in the same cluster likely share similar seasonal visitation patterns. By analyzing a cluster's overall seasonality, we can infer seasonality for individual parks which helps forecasting.
Model new park traffic: When new parks open up, forecasting is difficult due to lack of historical data. By identifying the most similar cluster, the new park can inherit the cluster's forecasting model which provides reasonable traffic estimates.

How clustering is gonna help in forecasting national park visitation ?

clustering gives a data-driven way to segment parks and provides an efficient framework to model/forecast at the cluster level while still allowing inferences about individual parks. The key benefit is needing fewer forecasting models rather than separate models for every single park.

Distance Metrics Used - Euclidean distance metric was employed to determine dissimilarities between data points during hierarchical clustering.

Data Format

The data is available as a CSV file (comma separated values) with each row representing details for a specific national park.

It contains both numeric and categorical data across different attributes:

Numeric columns: viscosity_rating, popularity, elevation_gain, visitor_usage, avg_rating etc.
Categorical columns: state_name, country_name, region, park_type plus features and activities as list values

Used for Analysis

The core columns provide park visitation stats plus characteristics that can drive popularity and traffic:

Visitors trend from past years shows seasonality and helps forecast
Location, climate, elevation etc as potential drivers
Reviews, ratings to indicate satisfaction
Variety of visitor activities offered at each park

Dataset Link

National parks data before and after transformation

Screenshot 2024-02-23 at 10.06.13 PM.png

Original Raw Data

The raw data as loaded into dataframe from the CSV file
Mixture of numeric (visitation stats, elevation) and categorical (park names, regions) data types

Screenshot 2024-02-23 at 10.09.17 PM.png

Standardized Data

Transformed into consumable format for machine learning where algorithms expect purely numeric input
Numeric columns like visitors, popularity etc are standardized using z-score scaling
Scales all features to mean 0 and standard deviation 1 spread

Partitional vs Hierarchical

Images

Key Results

The k-means model grouped parks into 3 clusters based on features like popularity, elevation gain etc.

Some observations:

Cluster 1 seems to capture remote & challenging mountain parks - high elevation, low popularity & reviews
Cluster 2 has accessible & high traffic parks- higher popularity score and reviews. May indicate overcrowding.
Cluster 3 contains niche activity parks - narrower set of available activities vs wider options in other clusters.

Hierarchical clustering revealed multi-tiered park groupings across granular niche segments to broader categories based on visitation drivers. The interconnected clusters aid transfer learning for new parks while controlling overgeneralization. They provide a customized park partitioning balancing specialization and generalization needs for both interpreting visitor behaviors and enabling robust forecasting.

The main elements covered in 50 words:

Multi-tiered clusters from niche to general groups
Interconnections enabling transfer learning
Customized granularity balance
Link visitor behaviors and patterns
Enhance forecasting robustness

Inferences

There are segments of niche unfrequented parks all the way to highly popular congested parks
Activities is correlated to traffic but clusters uncover more subtle differences in visitation behavior
Location & terrain drives some differences but attributes like available activities, ratings etc also play a role in distinguishing groupings

Reviews, popularity, and difficulty/accessibility seem tightly coupled driving cluster patterns
Uncovers unexpected groups e.g. a cluster of relatively unpopular but highly rated niche parks
Parks more directly competing for similar visitor segments become apparent

Relevance to
Forecasting

Key linkage to prediction/forecasting:

Segmenting parks via clustering provides more homogeneous groups to train visitation models
Transfer learning is possible i.e. a new park's traffic can be estimated using the matching cluster's forecasting model even if its individual data is unavailable.

The clustering offers strategic advantages regarding predicting future park traffic:

The hierarchy acts as a referral structure for park substitution - if a park becomes overcrowded, closest alternatives in cluster can be recommended
Transfer learning for new upcoming parks even without historical visit data
Shared seasonality & events across cluster parks provides collective signals for joint traffic forecasts

K-means

Code

Hierarchical

k-means clustering with 3 different values

k = 3 Clusters

Cluster 1: High elevation parks with low visitor traffic
Cluster 2: Accessible high volume popular parks
Cluster 3: Niche parks focused on specific activities

This captures some broad differences like accessibility and popularity. But evaluating silhouette scores indicates sub-optimal cohesion.

k = 5 Clusters

Emerging niche sub-groups around wildlife, hiking dominant, winter sports parks
Divide accessible parks into positively/negatively rated based on crowds
Retains same elevation-traffic cluster

Provides more specialized segments but diminishes silhouette score for coherence

k = 7 Clusters

Further fragmentation diluting cluster sizes
Isolates multi-activity parks from focused niche parks
Spurious clusters formed around single park outliers

Metrics indicate this overpartitions the data with diminishing added insight. 3 clusters remains reasonably optimal balancing uniqueness with statistical validity.

Silhouette analysis

Insights from Silhouette Scores

The silhouette score peak at k=2 indicates a good clustering that separates parks based on some key attribute like accessibility.
But additional clusters bring out more fine-grained niches like parks better rated for activities vs views.

Relevance to Visitation Forecasting

The segmented clusters provide more homogeneous groups to train park-specific visitation models rather than individual models for every park. This improves model accuracy while needing fewer models.
For new upcoming parks with limited data, finding similar clusters facilitates reasonable initialization of traffic forecasts by transferring learnings from an entire group's patterns.
Analysis of cluster-wide seasonal effects provides useful signals regarding peak visits timing that may apply to individual parks.

For k = 2 The average silhouette_score is : 0.848223231655088

For k = 3 The average silhouette_score is : 0.7154798239475918

For k = 4 The average silhouette_score is : 0.6766139024160427

For k = 5 The average silhouette_score is : 0.6445136617255823

K-means directly partitions parks by specifying number of clusters upfront and optimizing intra-cluster coherence catering well to generalizable segmentation. In contrast, hierarchical techniques build fine-grained dendrograms reflecting nuanced visitation drivers linkages between similar parks without requiring pre-set cluster numbers enabling deeper specialized profiles detection. Integrating both achieves a balanced generalization-specialization spectrum - avoiding premature simplification through broader categories while still transferring behaviors between interrelated niche park clusters. This ensemble clustering optimizes multi-perspective park groupings to uncover periodicity signals and interdependencies; maximizing predictive traffic modeling accuracy while minimizing assumptions biases. In effect, the data-driven park segments support targeted growth planning amid constraints through superior visitation analytics.

GitHub Link