Vineel Rayapati
Data Scientist
ARM
Association rule mining is a technique used to uncover hidden relationships between variables in large datasets. It is a popular method in data mining and machine learning and has a wide range of applications in various fields, such as market basket analysis, customer segmentation, and fraud detection we will be implementing on national parks data.
​
Key statistics like support, confidence and lift determine the strength of the rules:
-
Support: Fraction of records where both activities occur together
-
Confidence: Probability of camping given fishing is present
-
Lift: Ratio of observed support to expected support if independent. Values > 1 indicate positive correlation
Apriori Algorithm
The apriori algorithm has become one of the most widely used algorithms for frequent itemset mining and association rule learning. It has been applied to a variety of applications, including market basket analysis, recommendation systems, and fraud detection, and has inspired the development of many other algorithms for similar tasks.
example of apriori algorithm
How Does it work ?
The Apriori algorithm identifies frequently occurring patterns between items in a dataset in two stages:
-
Frequent Itemset Generation
-
It finds all itemsets that meet a minimum support threshold
-
Starts with individual items (1-itemsets) and iterates to find larger combination sets
-
Prunes infrequent itemsets in each iteration using the downward closure property to improve efficiency
-
-
Rule Generation
-
Uses the generated frequent itemsets to create if-then association rules
-
Confidence metric filters rules by determining predictiveness
-
Association rule mining (ARM) on park activities and attributes data can provide useful signals to help forecast future national park visitation in the following ways:
-
Identify Activity Associations
-
ARM uncovers activities that commonly co-occur across different parks
-
This information helps better estimate attendance and traffic for bundled activities
-
-
Park Recommendation Systems
-
The association rules can find what new activities can be suggested for a park based on their current offerings
-
Recommending suitable new activities can help drive incremental attendance
-
-
Enhance Predictive Models
-
The association rules quantify subtle relationships between park attributes and visitation
-
These derived features can be incorporated into time series and regression models to improve visitation forecasts
-
-
Transfer Learning
-
For newer parks with limited historical data, ARM rules from similar parks facilitates transfer learning
-
Park similarity can be defined using ARM lift/confidence measures
-
how ARM is used in the project ?
Applying association rule mining on the national parks dataset
Data Requirements
Association rule mining algorithms require transactional data in the following format:
-
Park ID field identifying each record
For national parks, this could be the park ID or name
-
Set of columns representing items, activities as Boolean/binary indicators
Preprocessing Steps
We will take the following steps for data preprocessing:
-
Parse Activities: Split activity string into separate dummy columns indicating presence using get_dummies()
-
Encode Categories: Convert location, rating to numeric categories using labelencoder
-
Filter Columns: Remove free text descriptions not needed for analysis
-
Aggregate by Parks: Groupby park ID, list relevant activity and attribute columns
-
Transaction Lists: Unnest groups into visitor activity lists for each park
​
Output Data Format
The output is a transaction dataset containing:
-
Park ID: Transaction ID
-
Activity indicators: Binary columns
Sample DataBefore Transformation
Sample Data After Transformation
Association rule mining analysis
Key results from the association rule mining analysis on activities:
Most Frequent Individual Activities
The most commonly offered trail activities based on support include:
-
Nature trips
-
Hiking
-
Birding
-
Walking
This indicates most trails cater to general nature-based activities rather than specialized sports.
Top Activity Associations
The strongest association rules in terms of confidence and lift relate:
-
{Hiking} -> {Birding} and vice versa showing they commonly co-occur. Similar hiking-nature trip and hiking-walking associations.
-
{Birding, Walking} -> {Nature Trips} trails supporting birding/walking also provide opportunities for general nature trips with high probability.
15 association rules by support, confidence, and lift generated
Distribution by Location
-
Segmenting rules by location reveals differences e.g. coastal parks disproportionately provide sea-kayaking while mountain parks offer camping/hiking variety.
​
Temporal Changes
-
Analyzing timeseries data shows strengthening mountain biking and trail running combine emerging over the years reflecting changing tastes.
Useful insights pertaining to the ARM analysis
Key Activity Associations
ARM revealed the strongest trail activity associations in terms of likelihood of co-occurrence. Top paired relationships uncovered were:
-
Hiking and camping
-
Fishing and boating
-
Birding and hiking
​
Temporal Trends
The data indicates strengthening momentum for trail running and mountain biking activity combinations reflecting evolving visitor preferences for specialized running and cycling trails over the past decade.
​
Location Differences
Marked differences were uncovered based on geographic region. Coastal parks had greater kayaking associations while wildlife and birding combinations dominated in certain states.
​
Adventure Profiles
Clustering rules by confidence bands reveals adventure seekers focusing exclusively on intense hiking/running trails versus more relaxed family-oriented combinations around camping/fishing.
These activity insights, trends and behavioral segments will greatly aid planners and administrators tailor trail infrastructure and marketing decisions catering to diverse tastes while keeping attractions balanced.
This plot shows how often people choose specific activities for their nature trips. Hiking, camping, and nature trips itself are most popular. Interestingly, some entries combine activities like hiking and trail running, suggesting users often plan multi-activity trips. While fishing is less frequent, it's still a chosen option.
Conclusion
Association rule mining extracted deep insights from trail activity data - revealing core relationships around hiking/camping combinations along with emerging momentum for trail running paired with mountain biking reflecting evolving tastes. Subtle but significant differences also emerged ranging from coastal parks specializing in kayaking versus greater inland emphasis on wildlife activities. Adventure profiles got uncovered spanning intensity spectrum from adrenaline-focused hiking/running enthusiasts to more relaxed, family-centered fishing/boating activity bundles. These revealed patterns impart data-driven perspectives guiding strategic infrastructure planning and marketing investments tailored to drive participation growth. Rule mining spotlights high-potential specialty niches while injecting calibrated balancing insights from cross-regional data signatures to manage crowding risks - opening possibilities ignored in conventional domain heuristics. The techniques showcase machine intelligence delivering differentiated yet interconnected findings spanning specialized to broad generalizations. Carefully interpreted, they form the bedrock for crafting public policies balancing unique site-specific qualities with learnings accrued across the entire range of installations.