DATA PREP & EDA

About data and Data Sources

The hiking trails dataset from AllTrails provides comprehensive information on trails across different national parks from 2019 to 2022. It captures multiple aspects of trails like identity, characteristics, usage patterns and location attributes. This makes it a valuable resource to study trends in outdoor recreation across United States.

The trail dataset originates from a National Park Service API and combines detailed trail metadata for over 100 unique trails across major national parks with a supplemental 40-year visitor usage data from 1980-2022. Converted from XML to CSV, it contains 35+ descriptive attributes per trail including name, geo-coordinates, length, elevation gain, route type, activities supported, visitor usage over years etc. The trail data is sourced from AllTrails, a leading platform for outdoor activity planning. The dataset focuses on annually updated visitor usage patterns, trail ratings and geographical distribution across prominent national parks. This reliability and breadth makes it highly useful for data science projects around site planning, capacity management, recreational activity modeling and park tourism.

Some potential usage analysis includes studying popularity growth of trail types over time, benchmarking visitor density against acreage to determine saturation, optimizing trail infrastructure based on projected visitor footfall and conducting sentiment analysis on reviews to identify site enhancements. The exhaustive dataset can provide insights across strategic, operational and infrastructure decisions for national parks.

Data source : API ( https://www.nps.gov/subjects/developer/api-documentation.htm )

Other Datasets : https://data.world/datasets/national-parks

cleaned Dataset : https://raw.githubusercontent.com/VINEEL8055/Forecasting-national-park-visitation/main/Final_dataset.csv

Data Cleaning and Preparation

The raw trail data was imported into a Pandas dataframe for analysis in Python. The dataset has 82 records and 35 feature columns capturing various parameters like trail length, rating, usage and location attributes.

As part of initial data quality checks, the distribution of key quantitative metrics like trail length and elevation gain were analyzed using histograms and summary statistics. This identified the presence of outliers - for instance maximum trail length was seen to be 120 miles.

Handling Outliers

Domain knowledge-based thresholds limited outliers to avoid analytical distortion. Length was limited at 30 miles. Above 10,000 feet, elevation gain was reduced to mean. Smoothing preserved distribution shape and reduced anomalous data.

Temporal Aggregation

For simpler analysis, visitor use statistics from 2022 to 1980 were consolidated into total visitor counts per trail. This decreased columns and allowed usage to be compared across lifespan equally.

Filling Gaps

In records without an average trail rating, the mean rating was 3.0 after experimenting with imputation procedures. It kept all trail recordings for a comprehensive dataset.

Outlier detection, smoothing, time series conversion, and missing data imputing have been done. Together, they make the dataset reliable for drawing analysis. Please let me know if you require clarification or have further data preparation questions.

Raw Data

Cleaned Data

Here is the Xml file data which is unorganized

Here is the CSV file where data is organized and rearranged

Data Vizualization

Sunburst visualization

This interactive sunburst visualization depicts the hierarchical breakdown of 2022 national park attendance across different regions, states, and individual area names. Each ring segment represents a node in the hierarchy, plotted radially from the innermost region ring. Segment size is proportional to the 2022 visitor count share. Tooltips display the category label and percent share. The layout encodes the multi-level hierarchy vertically, while visitation metrics are encoded horizontally as angular sweep. Coloring distinguishes branches and levels. We can analyze patterns such as the West region commanding the largest attendance share followed by Southeast. Drill-downs reveal specific state and park contributions. The title provides context. Overall, it enables analyzing the distribution of park visitors across geographic hierarchies and granular nodes.

Bar graph of different climbing routes

This bar graph visually compares the average popularity of different climbing route types using categorized bars. The height of each bar represents the mean popularity score for routes of that type. Route types are organized along the x-axis in descending order of frequency in the dataset. This arranges the most common types towards the left. A color gradient from yellow to red encodes low to high values to spotlight trends.

Bar chart of national park visitor count from 2022 vs 1990

This side-by-side bar chart compares national park visitor counts from 2022 versus 1990, enabling analysis of attendance changes over the decades. The x-axis maps the park area name, while the y-axis measures visitor count scaled from low to high values. Two sets of categorized bars are layered - orange for 2022 numbers and green for 1990. We can observe variation across parks and years. For instance, many parks exhibit substantial increases in visitors from 1990 to 2022 based on taller orange than green bars. The dual-color encoding and common legend clearly distinguish between years for comparison. Rotated x-tick labels improve readability. Overall the chart efficiently juxtaposes attendance data over time to assess and reason about shifting visitor trends.

Scatter plot between climbing route difficulty ratings and elevation gain

This scatter plot visualizes the correlation between climbing route difficulty ratings and elevation gain using dark blue dots. The x-axis maps difficulty ratings, with higher values indicating more demanding climbs. The y-axis measures elevation gain in feet. Each dot plots one route's rating and gain. A linear regression trendline summarizes the positive correlation between metrics — more difficult climbs tend to also have greater elevation changes. Visually, moderate correlation is evident. The spread of dots allows analyzing the attributes and connections between routes. Additional encodings like color or size could further depict conditional variables. Overall, key tradeoffs are efficiently visualized within a small space.

Histogram visualizes the distribution changes in visitors from 2021 to 2022

This histogram visualizes the distribution of percentage changes in visitors from 2021 to 2022 across national parks. The x-axis maps percent change buckets, while the y-axis plots park count within each. We can observe a right-skewed shape - most parks saw moderate increases in visitors up to 25%, while fewer experienced higher growth rates. The title and axes labels provide semantic context. The histogram compactly summarizes variation in attendance trends, highlighting the predominant stability along with parks seeing more significant gains or declines.

Time series of visitors from 1980 to 2022

This time series line chart visualizes national park visitation from 1980 to 2022 over the x-axis year range. The y-axis plots total annual visitors summed across all parks for each year, encoded as green line height. The titled graphic summarizes overall attendance trends, highlighting peaks and drops across decades. We can observe variation and growth patterns - for example, attendance growth from the 1980s into the late 1990s, followed by subsequent fluctuation. Axis labels clarify the time and visit metrics. Rotated x-tick labels improve readability. In summary, the line compactly encodes changes to aggregate national park attendance over a 42-year span, enabling temporal pattern analysis.

Interactive Folium map visualizes hiking trail locations

This interactive Folium map visualizes hiking trail locations and attributes from a dataset of routes. The map is centered based on the mean latitude and longitude across all routes. Each trail is represented by a marker plotted at its respective GPS coordinates. On click, popups display additional information including trail name and length in miles. By encoding details like position and length into the markers, trends across locations and trail lengths can be visually analyzed. The zoomable map provides geographical context to understand the spatial distribution of routes. Additional stylistic encodings could depict patterns across difficulty or ratings. Altogether it enables interactive exploration of the underlying dataset through a geographic visualization.

Correlation heatmap visualizes the relationship

This full-scale correlation heatmap visualizes the relationship between various trail attributes and visitation metrics across columns and rows. Tile color intensity and shade encodes the correlation coefficient strength, ranging from negative 1 to positive 1 between features. Annotated values display the exact coefficients. A sequential cool to warm color scheme spotlights positive versus negative correlations. We can scan for patterns – for instance, elevation gain exhibits negative correlation with many visitor metrics. In contrast, aspects like difficulty level show positive relationships. The expanded size coupled with title provides context. Overall the heatmap compactly encodes multivariate correlation analysis into a matrix view, allowing identification of feature associations and trends tied to visitation patterns. Interactive sorting and filtering could further aid investigation.

Github