Vineel Rayapati
Data Scientist
Clustering
Clustering is an unsupervised machine learning technique with a lot of applications in the areas of pattern recognition, image analysis, customer analytics, market segmentation, social network analysis, and more. A broad range of industries use clustering, from airlines to healthcare and beyond.
Partitional clustering: A clustering technique that divides the data points into non-overlapping clusters such that each data point belongs to exactly one cluster. It tries to learn from the input data a partition that optimizes a certain clustering criterion. Examples include k-means, mixture models, and spectral clustering.
​
​
Hierarchical clustering:
A clustering technique that builds a hierarchy of clusters in a tree-like structure called a dendrogram. It successively merges similar clusters or successively splits clusters based on similarity. It does not require specifying the number of clusters a priori. Examples include agglomerative and divisive hierarchical clustering.
Clustering national parks based on their characteristics can provide useful insights to help forecast future park visitation in several ways:
​
-
Identify similar groups of parks: By using clustering techniques like k-means and hierarchical clustering, we can identify parks that are similar to one another based on features like popularity, elevation, activities, reviews etc. Knowing park similarity groups simplifies the forecasting task.
-
Forecast at cluster level: Rather than building separate forecasting models for each individual park, we can build models to forecast visitation for a cluster of similar parks. This is more efficient and improves model accuracy since more training data is available.
-
Infer park seasonality patterns: Parks in the same cluster likely share similar seasonal visitation patterns. By analyzing a cluster's overall seasonality, we can infer seasonality for individual parks which helps forecasting.
-
Model new park traffic: When new parks open up, forecasting is difficult due to lack of historical data. By identifying the most similar cluster, the new park can inherit the cluster's forecasting model which provides reasonable traffic estimates.
How clustering is gonna help in forecasting national park visitation ?
​
clustering gives a data-driven way to segment parks and provides an efficient framework to model/forecast at the cluster level while still allowing inferences about individual parks. The key benefit is needing fewer forecasting models rather than separate models for every single park.
​
Distance Metrics Used - Euclidean distance metric was employed to determine dissimilarities between data points during hierarchical clustering.
​
Data Format
The data is available as a CSV file (comma separated values) with each row representing details for a specific national park.
It contains both numeric and categorical data across different attributes:
-
Numeric columns: viscosity_rating, popularity, elevation_gain, visitor_usage, avg_rating etc.
-
Categorical columns: state_name, country_name, region, park_type plus features and activities as list values
​
Used for Analysis
The core columns provide park visitation stats plus characteristics that can drive popularity and traffic:
-
Visitors trend from past years shows seasonality and helps forecast
-
Location, climate, elevation etc as potential drivers
-
Reviews, ratings to indicate satisfaction
-
Variety of visitor activities offered at each park
​
National parks data before and after transformation
Original Raw Data
-
The raw data as loaded into dataframe from the CSV file
-
Mixture of numeric (visitation stats, elevation) and categorical (park names, regions) data types
Standardized Data​
-
Transformed into consumable format for machine learning where algorithms expect purely numeric input
-
Numeric columns like visitors, popularity etc are standardized using z-score scaling
-
Scales all features to mean 0 and standard deviation 1 spread
Partitional vs Hierarchical
Images
Key Results
The k-means model grouped parks into 3 clusters based on features like popularity, elevation gain etc.
Some observations:
-
Cluster 1 seems to capture remote & challenging mountain parks - high elevation, low popularity & reviews
-
Cluster 2 has accessible & high traffic parks- higher popularity score and reviews. May indicate overcrowding.
-
Cluster 3 contains niche activity parks - narrower set of available activities vs wider options in other clusters.
Hierarchical clustering revealed multi-tiered park groupings across granular niche segments to broader categories based on visitation drivers. The interconnected clusters aid transfer learning for new parks while controlling overgeneralization. They provide a customized park partitioning balancing specialization and generalization needs for both interpreting visitor behaviors and enabling robust forecasting.
The main elements covered in 50 words:
-
Multi-tiered clusters from niche to general groups
-
Interconnections enabling transfer learning
-
Customized granularity balance
-
Link visitor behaviors and patterns
-
Enhance forecasting robustness
Inferences
-
There are segments of niche unfrequented parks all the way to highly popular congested parks
-
Activities is correlated to traffic but clusters uncover more subtle differences in visitation behavior
-
Location & terrain drives some differences but attributes like available activities, ratings etc also play a role in distinguishing groupings
-
Reviews, popularity, and difficulty/accessibility seem tightly coupled driving cluster patterns
-
Uncovers unexpected groups e.g. a cluster of relatively unpopular but highly rated niche parks
-
Parks more directly competing for similar visitor segments become apparent
Relevance to
Forecasting
Key linkage to prediction/forecasting:
-
Segmenting parks via clustering provides more homogeneous groups to train visitation models
-
Transfer learning is possible i.e. a new park's traffic can be estimated using the matching cluster's forecasting model even if its individual data is unavailable.
The clustering offers strategic advantages regarding predicting future park traffic:
-
The hierarchy acts as a referral structure for park substitution - if a park becomes overcrowded, closest alternatives in cluster can be recommended
-
Transfer learning for new upcoming parks even without historical visit data
-
Shared seasonality & events across cluster parks provides collective signals for joint traffic forecasts
Code
k-means clustering with 3 different values
k = 3 Clusters
-
Cluster 1: High elevation parks with low visitor traffic
-
Cluster 2: Accessible high volume popular parks
-
Cluster 3: Niche parks focused on specific activities
This captures some broad differences like accessibility and popularity. But evaluating silhouette scores indicates sub-optimal cohesion.
k = 5 Clusters
-
Emerging niche sub-groups around wildlife, hiking dominant, winter sports parks
-
Divide accessible parks into positively/negatively rated based on crowds
-
Retains same elevation-traffic cluster
Provides more specialized segments but diminishes silhouette score for coherence
k = 7 Clusters
-
Further fragmentation diluting cluster sizes
-
Isolates multi-activity parks from focused niche parks
-
Spurious clusters formed around single park outliers
​
Metrics indicate this overpartitions the data with diminishing added insight. 3 clusters remains reasonably optimal balancing uniqueness with statistical validity.
Silhouette analysis
Insights from Silhouette Scores
-
The silhouette score peak at k=2 indicates a good clustering that separates parks based on some key attribute like accessibility.
-
But additional clusters bring out more fine-grained niches like parks better rated for activities vs views.
Relevance to Visitation Forecasting
-
The segmented clusters provide more homogeneous groups to train park-specific visitation models rather than individual models for every park. This improves model accuracy while needing fewer models.
-
For new upcoming parks with limited data, finding similar clusters facilitates reasonable initialization of traffic forecasts by transferring learnings from an entire group's patterns.
-
Analysis of cluster-wide seasonal effects provides useful signals regarding peak visits timing that may apply to individual parks.
For k = 2 The average silhouette_score is : 0.848223231655088
For k = 3 The average silhouette_score is : 0.7154798239475918
For k = 4 The average silhouette_score is : 0.6766139024160427
For k = 5 The average silhouette_score is : 0.6445136617255823
K-means directly partitions parks by specifying number of clusters upfront and optimizing intra-cluster coherence catering well to generalizable segmentation. In contrast, hierarchical techniques build fine-grained dendrograms reflecting nuanced visitation drivers linkages between similar parks without requiring pre-set cluster numbers enabling deeper specialized profiles detection. Integrating both achieves a balanced generalization-specialization spectrum - avoiding premature simplification through broader categories while still transferring behaviors between interrelated niche park clusters. This ensemble clustering optimizes multi-perspective park groupings to uncover periodicity signals and interdependencies; maximizing predictive traffic modeling accuracy while minimizing assumptions biases. In effect, the data-driven park segments support targeted growth planning amid constraints through superior visitation analytics.