top of page

DATA Collection

About data and Data Sources

This project relies on two key datasets: financial news headlines and historical stock price data, both carefully gathered and processed to ensure relevance and accuracy. The news dataset was scraped from the Markets Insider platform, focusing on articles related to Nvidia stock over a 90-day period. Using a custom Python script powered by BeautifulSoup, the scraper retrieved information such as article titles, publication dates, sources, and links. A sentiment analysis tool, TextBlob, was then employed to assign sentiment scores to each headline, categorizing them as positive, negative, or neutral. This enriched the dataset with valuable qualitative insights into market sentiment surrounding Nvidia.

 

In parallel, historical stock price data was sourced from Yahoo Finance, providing granular hourly data on Nvidia’s stock performance. Key attributes like opening price, closing price, high, low, and trading volume were included, offering a quantitative perspective on market movements. By aligning the timestamps of the news articles with the closest corresponding stock data, a comprehensive dataset was created. This integration allowed the project to explore how news sentiment impacts specific market metrics.

 

The final dataset combines over 90 days of detailed financial news with Nvidia’s hourly stock performance, ensuring a rich mix of qualitative and quantitative data for predictive modeling. This blend provides a foundation for building advanced machine learning models capable of uncovering relationships between news sentiment and stock price fluctuations. The datasets are robust and reliable, setting the stage for insightful analysis and actionable predictions. 

 

Data Sources:

  • News Data: Markets Insider (via web scraping)

  • Stock Data: Yahoo Finance API  

​

Screenshot 2024-12-07 at 3.12.53 AM.png
image.png

Data Cleaning and Preparation

image.png
Screenshot 2024-12-07 at 3.14.54 AM.png
Screenshot 2024-12-07 at 3.16.13 AM.png

The raw financial news and stock data were imported into Pandas dataframes for processing in Python. The news dataset contains headlines, publication dates, and sources, while the stock dataset includes hourly Nvidia stock prices with attributes like open, high, low, close, and volume.

 

Sentiment Analysis  

Each news headline was analyzed using the FinBERT pre-trained model to capture sentiment (Positive, Neutral, or Negative) and a corresponding sentiment score. These scores provided a quantitative measure of the tone of the news articles, forming a critical feature for analysis.

 

Data Alignment  

To align news articles with corresponding stock prices, timestamps were used to match news data with the nearest stock data point. This ensured that sentiment features reflected the stock’s context at the time.

 

Handling Missing Data  

Records with missing key values, such as stock prices or sentiment scores, were dropped to maintain data integrity. This reduced potential bias or errors during modeling.

 

Outlier Detection and Conversion  

Stock price columns were inspected for formatting inconsistencies. Bracketed or improperly formatted numeric values were cleaned and converted into usable numeric types.

 

Removing Duplicates  

Duplicate entries, identified during the cleaning process, were removed to ensure the dataset remained concise and free from redundancy.

 

Index Resetting and Organization  

After cleaning, the dataset’s index was reset, and all changes were saved into a cleaned dataset file: `nvda_news_with_finbert_and_stock_data_cleaned.csv`. This ensured the dataset was ready for reliable analysis and modeling.  

 

Together, these steps transformed the raw data into a well-structured, comprehensive dataset that combines qualitative sentiment analysis with quantitative stock price data, forming a strong foundation for predictive modeling.

Raw Data

Cleaned Data

Screenshot 2024-12-06 at 9.33.40 PM.png

Here is the Xml file data which is unorganized

Screenshot 2024-12-06 at 9.56.21 PM.png

Here is the CSV file where data is organized and rearranged

Data Vizualization

Histogram for Sentiment Scores

histogram visualizes the distribution of sentiment scores derived from news headlines about Nvidia stock. The x-axis represents sentiment scores, ranging from -1 (negative) to +1 (positive), while the y-axis indicates the frequency of each score range. The majority of scores cluster near 1, revealing that most news articles carried a positive tone. The KDE curve overlays the histogram to smooth the distribution, providing a clearer understanding of sentiment trends. This visualization highlights a predominantly favorable sentiment surrounding Nvidia during the analyzed period, offering valuable insights into market perception and its potential influence on stock price movements.

image.png
image.png

Line Graph for Stock Close Price

TThis line plot visualizes the trend in Nvidia’s stock close price over a specified time period. The x-axis represents the timeline, while the y-axis shows the closing price of the stock. The downward trajectory indicates a general decline in Nvidia’s stock price over time, with noticeable fluctuations at certain intervals. Peaks and troughs in the graph suggest periods of volatility, likely influenced by market events or news. This visualization is instrumental in understanding the overall performance of Nvidia’s stock and can provide context when paired with sentiment data to analyze how news impacts stock trends.

Scatter plot of Sentiment Score vs Stock Close Price

​

This scatter plot illustrates the relationship between sentiment scores from news headlines and Nvidia’s stock closing prices. The x-axis represents sentiment scores ranging from 0.4 (negative/neutral) to 1 (highly positive), while the y-axis shows stock close prices. Each point represents a data pair, mapping how sentiment aligns with stock performance. The clustering of points at higher sentiment scores (near 1) indicates a prevalence of positive sentiment during the analyzed period. However, the scatter does not reveal a strong direct correlation between sentiment and stock prices, suggesting that other factors may influence price movements. This plot highlights sentiment's potential impact on stock dynamics.

4o

image.png
bottom of page