Introduction
In recent years, integrating artificial intelligence (AI) in stock trading has revolutionized investors’ decisions. With the emergence of Large Language Models (LLMs) such as GPT-3 and GPT-4, a paradigm shift has occurred, making complex market analyses and insights more accessible to individual investors and traders. This transformative technology leverages vast amounts of data and sophisticated algorithms to provide a depth of market understanding that was once the exclusive domain of institutional investors. This article focuses on developing a personalized AI trading consultant using LLMs, designed to match individual investment profiles based on risk appetite, investment timeframe, budget, and desired returns, empowering retail investors with personalized, strategic investment advice.
Stock trading consultants powered by Large Language Models (LLMs) like GPT-3 and GPT-4 have revolutionized financial advisory services. They can leverage AI to analyze historical stock data and current financial news, providing personalized investment advice that aligns with an investor’s unique portfolio and financial objectives. We will attempt to build a consultant to predict market behavior and trends, offering tailored recommendations based on individual risk tolerance, investment duration, available capital, and desired returns.
Learning Objectives
By the end of this article, readers will be able to:
- Gain insight into how AI and LLMs like GPT-3 transform stock market analysis and trading.
- Recognize the ability of AI-driven tools to provide personalized investment advice based on individual risk profiles and investment goals.
- Learn how AI utilizes historical and real-time data to formulate investment strategies and predictions.
- Understand how AI in stock trading makes sophisticated investment strategies accessible to a broader audience, including retail investors.
- Discover how to leverage AI-driven tools for informed decision-making in personal investment and stock trading.
- Concept of a Stock Trading Consultant Using LLMs
This article was published as a part of the Data Science Blogathon.
About the Dataset
The dataset for this project, sourced from the New York Stock Exchange and available on Kaggle, comprises four CSV files spanning seven years. It includes ‘fundamentals.csv’ with essential financial metrics, ‘prices.csv’ and ‘prices-split-adjusted.csv’ providing historical stock prices and adjustments for stock splits, and ‘securities.csv’ offering additional company information like sector classification and headquarters. Collectively, these files provide a comprehensive view of company performance and stock market dynamics.
Data Preparation
Implementing the stock trading consultant with Large Language Models (LLMs) like GPT-4 starts with crucial data preparation. This process includes critical tasks: data cleaning, normalization, and categorization, using the provided datasets: fundamentals.csv, prices.csv, prices-split-adjusted.csv, and securities.csv.
Step 1: Data Cleaning
- In the ‘Fundamental Dataset,’ we address missing values in ‘For Year,’ ‘Earnings Per Share,’ and ‘Estimated Shares Outstanding’ (173, 219, and 219 missing values) using median imputation.
- We convert the ‘Period Ending’ column to datetime format, making numerical fields analysis-ready.
import pandas as pd
# Loading the datasets
fundamentals = pd.read_csv('/content/fundamentals.csv')
prices = pd.read_csv('/content/prices.csv')
prices_split_adjusted = pd.read_csv('/content/prices-split-adjusted.csv')
securities = pd.read_csv('/content/securities.csv')
# Handling missing values and data type conversions in the Fundamentals dataset
fundamentals_info = fundamentals.info()
fundamentals_missing_values = fundamentals.isnull().sum()
# Formatting date columns in all datasets
fundamentals['Period Ending'] = pd.to_datetime(fundamentals['Period Ending'])
prices['date'] = pd.to_datetime(prices['date'])
prices_split_adjusted['date'] = pd.to_datetime(prices_split_adjusted['date'])
# Displaying information about missing values in the Fundamentals dataset
fundamentals_missing_values
# Dropping the unnecessary 'Unnamed: 0' column
fundamentals.drop(columns=['Unnamed: 0'], inplace=True)
# Imputing missing values in 'Earnings Per Share' and 'Estimated Shares Outstanding' with the median
for column in ['Earnings Per Share', 'Estimated Shares Outstanding']:
median_value = fundamentals[column].median()
fundamentals[column].fillna(median_value, inplace=True)
# Re-checking for missing values after imputation
fundamentals_missing_values_post_imputation = fundamentals.isnull().sum()
fundamentals_missing_values_post_imputation
- The ‘date’ columns are already consistent in datetime format for the prices and prices-split-adjusted datasets. We verify data consistency, especially regarding stock splits.
# Checking for consistency between the
# Prices and Prices Split Adjusted datasets
# We will compare a sample of data for the
# same ticker symbols and dates in both datasets
# Selecting a sample of ticker symbols
sample_tickers = prices['symbol'].unique()[:5]
# Creating a comparison DataFrame for each ticker symbol in the sample
comparison_data = {}
for ticker in sample_tickers:
prices_data = prices[prices['symbol'] == ticker]
prices_split_data = prices_split_adjusted
[prices_split_adjusted['symbol'] == ticker]
merged_data = pd.merge(prices_data,
prices_split_data, on='date', how='inner',
suffixes=('_raw', '_split'))
comparison_data[ticker] = merged_data
# Displaying the comparison for the first ticker symbol as an example
comparison_data[sample_tickers[0]].head()
A comparison of prices.csv and prices-split-adjusted.csv for a sample ticker symbol (WLTW) reveals differences in open, close, low, and high stock prices due to stock split adjustments. Volume columns are consistent, indicating accurate trading volume data.
Step 2: Normalization of Prices
We use the prices-split-adjusted.csv dataset for the stock trading consultant as it offers a consistent view of stock prices over time, accounting for stock splits.
Step 3: Data Integration
The final data preparation step involves integrating these datasets. We merge fundamentals.csv, prices-split-adjusted.csv, and securities.csv, creating a comprehensive data frame for analysis. Given their large size, we integrate the most relevant columns based on the ticker symbol and date fields to match financials with stock prices and company information.
# Selecting relevant columns from each dataset for integration
fundamentals_columns = ['Ticker Symbol',
'Period Ending', 'Earnings Per Share', 'Total Revenue']
prices_columns = ['symbol', 'date',
'open', 'close', 'low', 'high', 'volume']
securities_columns = ['Ticker symbol',
'GICS Sector', 'GICS Sub Industry']
# Renaming columns for consistency
fundamentals_renamed = fundamentals[fundamentals_columns]
.rename(columns={'Ticker Symbol': 'symbol', 'Period Ending': 'date'})
prices_split_adjusted_renamed = prices_split_adjusted
[prices_columns].rename(columns={'open': 'open_price',
'close': 'close_price', 'low': 'low_price', 'high': 'high_price',
'volume': 'trade_volume'})
securities_renamed = securities[securities_columns]
.rename(columns={'Ticker symbol': 'symbol'})
# Merging datasets
merged_data = pd.merge(pd.merge(fundamentals_renamed,
prices_split_adjusted_renamed, on=['symbol', 'date']),
securities_renamed, on='symbol')
# Displaying the first few rows of the integrated dataset
merged_data.head()
The resultant dataset includes key metrics: earnings per share, total revenue, open/close/low/high stock prices, trading volume, and sector information for each ticker symbol.
Exploratory Data Analysis (EDA)
Next, we will conduct EDA to understand the distribution and relationships in the dataset, which is crucial for feature selection and model training.
import matplotlib.pyplot as plt
import seaborn as sns
# Exploratory Data Analysis (EDA)
# Summary statistics for numerical columns
numerical_summary = merged_data.describe()
# Correlation matrix to understand relationships between
# different numerical features
correlation_matrix = merged_data.corr()
# Visualizing the correlation matrix using a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()
correlation_matrix
The EDA provides valuable insights into our integrated dataset:
- We observed a diverse spectrum of corporate fiscal health. Earnings Per Share span from negative to positive extremes, and Total Revenue figures reflect a vast range of company sizes.
- Notable fluctuations mark the assortment of stock prices, while the volume of trades underscores the diversity in market activity among different entities.
- Our correlation study reveals a strong linkage among various stock price points, a moderate association between a company’s earnings and stock value, and a mild relationship between revenue scales and trading volumes.
- An intriguing discovery was the inverse relationship between trading volumes and stock prices, suggesting that increased trading activity does not necessarily correlate with higher stock prices.
Feature Engineering
With these analytical insights, we move forward to enhance our dataset through feature engineering:
- We’re introducing predictive financial ratios:
- PE_Ratio: This ratio, representing Price to Earnings, is derived by dividing the closing stock price by the Earnings Per Share.
- Price_Change: This reflects the variance in stock price, calculated by subtracting the opening price from the closing price.
- Average_Price: This metric averages the day’s opening, closing, low, and high stock prices.
- To address anomalies in the data, the Interquartile Range (IQR) method will identify and mitigate outliers within our numerical fields.
- Normalization of pivotal numerical features, including Earnings Per Share and Total Revenue, will be executed using the MinMaxScaler, ensuring a standardized scale for model input.
- The ‘GICS Sector’ category will undergo one-hot encoding to convert sector classifications into a binary format compatible with algorithmic learning processes.
- The culmination of this process yields a dataset enriched with 103 columns, amalgamating the original data, the newly engineered features, and the one-hot encoded sectors.
from sklearn.preprocessing import MinMaxScaler
# Renaming columns for consistency
fundamentals_renamed = fundamentals.rename(columns=
{'Ticker Symbol': 'symbol', 'Period Ending': 'date'})
prices_split_adjusted_renamed = prices_split_adjusted.
rename(columns={'symbol': 'symbol', 'date': 'date', 'open':
'open_price', 'close': 'close_price', 'low': 'low_price',
'high': 'high_price', 'volume': 'trade_volume'})
securities_renamed = securities.rename(columns={'Ticker
symbol': 'symbol'})
# Merging datasets
merged_data = pd.merge(pd.merge(fundamentals_renamed,
prices_split_adjusted_renamed, on=['symbol', 'date']),
securities_renamed, on='symbol')
# Creating new features
merged_data['PE_Ratio'] = merged_data['close_price'] /
merged_data['Earnings Per Share']
merged_data['Price_Change'] = merged_data['close_price'] -
merged_data['open_price']
merged_data['Average_Price'] = (merged_data['open_price'] +
merged_data['close_price'] + merged_data['low_price'] +
merged_data['high_price']) / 4
# Handling Outliers: Using the IQR method to identify and
handle outliers in numerical columns
Q1 = merged_data.quantile(0.25)
Q3 = merged_data.quantile(0.75)
IQR = Q3 - Q1
merged_data = merged_data[~((merged_data.isin([Q1 - 1.5 *
IQR, Q3 + 1.5 * IQR])).any(axis=1))]
# Feature Scaling: Normalizing the numerical features
numerical_features = ['Earnings Per Share', 'Total Revenue',
'open_price', 'close_price', 'low_price', 'high_price',
'trade_volume', 'PE_Ratio', 'Price_Change', 'Average_Price']
scaler = MinMaxScaler()
merged_data[numerical_features] = scaler.fit_transform
(merged_data[numerical_features])
# Encoding Categorical Variables: One-hot encoding for 'GICS Sector'
merged_data_encoded = pd.get_dummies(merged_data, columns=['GICS Sector'])
# Displaying a sample of the preprocessed dataset
merged_data_encoded.head()
Model Training & Testing
For our stock price prediction project, we must choose a machine learning model that excels in handling regression tasks, as we are dealing with predicting continuous stock price values. Given our dataset’s diverse and complex nature, our model needs to capture intricate patterns within the data adeptly.
- Model Selection: Chosen for its versatility and robustness, the Random Forest Regressor is ideal for handling our dataset’s complexity and variety of features. It excels in regression tasks, is less prone to overfitting, and can take non-linear relationships.
- Data Splitting: The dataset is split into an 80/20 ratio for training and testing. This ensures a comprehensive training phase while retaining a significant dataset for validation.
- Handling Missing Values: Missing values are addressed using the SimpleImputer’s median filling strategy from sklearn.impute, ensuring data completeness and consistency across the dataset.
- Training Process: The model is trained on the imputed training data, reflecting real-world scenarios with missing data points.
- Performance Evaluation: After training, the model’s predictive accuracy is assessed using the imputed testing set, giving insights into its real-world applicability.
The following code demonstrates the steps involved in this process,
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
# Assuming 'close_price' as the target variable for prediction
X = merged_data_encoded.drop(['close_price',
'symbol', 'date'], axis=1) # dropping non-numeric and target variable
y = merged_data_encoded['close_price']
# Checking for non-numeric columns in the dataset
non_numeric_columns = X.select_dtypes(include=['object']).columns
# If there are non-numeric columns, we'll remove them from the dataset
if len(non_numeric_columns) > 0:
X = X.drop(non_numeric_columns, axis=1)
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split
(X, y, test_size=0.2, random_state=42)
# Initializing the Random Forest Regressor
random_forest_model = RandomForestRegressor
(n_estimators=100, random_state=42)
# Creating an imputer object with a median filling strategy
imputer = SimpleImputer(strategy='median')
# Applying the imputer to the training and testing sets
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
# Training the model
random_forest_model.fit(X_train_imputed, y_train)
# Predicting on the test set
y_pred = random_forest_model.predict(X_test_imputed)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse, r2
Model Performance
The output of our Random Forest Regressor model indicates the following:
- Mean Squared Error (MSE): The low MSE value of 8.592×10−5 suggests that our model’s predictions are very close to the actual values, indicating high accuracy in predicting stock prices.
- R-squared (R²): An R² value of approximately 0.96 implies that the model can explain about 96% of the variability in the stock prices, which is exceptionally high for stock market predictions.
Integration with GPT-4 API
After training the Random Forest Regressor model and enabling it for predictions, we will integrate it seamlessly with the GPT-4 API. This integration facilitates the model to analyze and predict stock prices and communicate these insights effectively to the users. The GPT-4 API, with its advanced natural language processing capabilities, can interpret complex financial data and present it in a user-friendly way.
How does the Integration Work?
Here’s a detailed explanation of how the integration works:
- User Query Processing: The function get_model_predictions processes the user’s query to extract relevant information, such as the ticker symbol. Since we do not have the latest data, we will utilize the summary of the particular stock in question and generate test data.
- Model Prediction and Scaling:The Random Forest model predicts the stock price from the test data and scales it back to its original value using the previously defined scaling method.
- Preparing the Prompt for GPT-4: The query_gpt4_with_context function combines the user’s query, model predictions, and additional context, including price trends, fundamentals, and securities information for the specified stock. This prompt guides GPT-4 in delivering a tailored financial consultation based on the user’s query and the model’s analysis.
- GPT-4 Query and Response: The prompt generates a tailored response based on the data and the user’s financial profile.
import os
from openai import OpenAI
from sklearn.impute import SimpleImputer
os.environ["OPENAI_API_KEY"] ='YOUR API KEY'
client = OpenAI()
imputer = SimpleImputer(strategy='median')
# Function to get model predictions based on user query
def get_model_predictions(user_query):
ticker_symbol = user_query[0].split()[-1].strip().upper()
# Applying imputter to the data and using the model to make
# predictions
imputed_test_data = imputer.fit_transform(test_data)
predicted_scaled_value = random_forest_model.predict(imputed_test_data)[0]
confidence = 0.9 #Assuming 90% confidence in our predictions
# Creaing a placeholder array with the same shape as the
# original scaled data
placeholder_array = np.zeros((1, len(numerical_features)))
# Inserting the predicted scaled value at the correct position
placeholder_array[0][3] = predicted_scaled_value
# Performing the inverse transformation
predicted_original_value = scaler.inverse_transform(placeholder_array)
# Extracting the scaled-back value for 'close_price'
predicted_stock_price = predicted_original_value[0][3]
return {
"predicted_stock_price": predicted_stock_price,
"confidence": confidence
}
# Function to query GPT-4 with model context
def query_gpt4_with_context
(model_context,additional_data_context, user_query):
prompt = f"{additional_data_context}\n\n
{model_context}\n\n{user_query}\n\nYou are a financial advisor,
an expert stock market consultant. Study the predictions, the data
provided and the client's profile to provide consultation
related to the stock, to the user based on the above information.
Also, focus your advise to the given stock only."
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "system", "content": prompt}]
)
return response.choices[0].message.content.strip()
Now, let’s test the efficacy of our stock consultant with a couple of scenarios:
Test Case 1
I am a 25-year-old single male with a high-risk tolerance. I seek at least 15% annual growth in all my stock investments. Two years ago, I bought 100 shares of ABBV at $40 per share. Should I sell my ABBV shares? What is the likely profit in dollars and percentage from this sale?
Let’s feed this query into our model and see what output we get.
user_query = ["I am 25-year-old single male. Given my status,
I have a high risk tolerance and look for at least 15% growth
per year in all my stock investments. I bought 100 shares
of ABBV at $40 per share two years ago.
Should I look to sell my shares of company ABBV",
"What is my likely profit in $ and % from this sale?"]
# Generating a random row of data for the queried stock
# based on its summary statistics
ticker_symbol = user_query[0].split()[-1].strip().upper()
df1 = merged_data_encoded[merged_data_encoded['symbol'] == ticker_symbol]
df1 = df1.drop(['close_price'], axis=1)
test_data = df1.describe().loc[['mean', 'std']].T
test_data['random_value'] = np.random.randn(len(test_data))
* test_data['std'] + test_data['mean']
# Selecting only the random values to form a DataFrame
test_data = pd.DataFrame(test_data['random_value']).transpose()
model_predictions = get_model_predictions(user_query)
# Generating model context
model_context = f"The current predicted stock price of
{ticker_symbol} is ${model_predictions['predicted_stock_price']}
with a confidence level of {model_predictions['confidence']*100}%."
# Generating additional data context
additional_data_context = prices[prices['symbol']==ticker_symbol]
,fundamentals[fundamentals['Ticker Symbol']==ticker_symbol]
,securities[securities['Ticker symbol']==ticker_symbol]
gpt4_response = query_gpt4_with_context(model_context
,additional_data_context, user_query)
print(f"GPT-4 Response: {gpt4_response}")
Test Case 2
I am 40 year old married female. Given my status, I have a low risk tolerance and look for atleast 10% growth per year in all my stock investments. I bought 100 shares of ALXN at $100 per share two years ago. Should I look to sell my shares of company ALXN? What is my likely profit in $ and % from this sale?
user_query = ["I am 40 year old married female.
Given my status, I have a low risk tolerance and
look for atleast 10% growth per year in all my stock
investments. I bought 100 shares of ALXN at $100 per
share two years ago. Should I look to sell my shares
of company ALXN?",
"What is my likely profit in $ and % from this sale?"]
# Generating a random row of data for the queried stock
# based on its summary statistics
ticker_symbol = user_query[0].split()[-1].strip().upper()
df1 = merged_data_encoded[merged_data_encoded['symbol'] == ticker_symbol]
df1 = df1.drop(['close_price'], axis=1)
test_data = df1.describe().loc[['mean', 'std']].T
test_data['random_value'] = np.random.randn(len(test_data))
* test_data['std'] + test_data['mean']
# Selecting only the random values to form a DataFrame
test_data = pd.DataFrame(test_data['random_value']).transpose()
model_predictions = get_model_predictions(user_query)
# Generating model context
model_context = f"The current predicted stock price of
{ticker_symbol} is ${model_predictions['predicted_stock_price']}
with a confidence level of {model_predictions['confidence']*100}%."
# Generating additional data context
additional_data_context = prices[prices['symbol']==ticker_symbol]
,fundamentals[fundamentals['Ticker Symbol']==ticker_symbol]
,securities[securities['Ticker symbol']==ticker_symbol]
gpt4_response = query_gpt4_with_context(model_context
,additional_data_context, user_query)
print(f"GPT-4 Response: {gpt4_response}")
Challenges
- One of the biggest challenges in implementing a project like this is ensuring the accuracy and timeliness of financial data is crucial. Inaccurate or outdated data can lead to misguided predictions and recommendations.
- Numerous unpredictable factors, including geopolitical events, economic changes, and company-specific news influence stock markets. These elements can make AI predictions less reliable.
- AI models, despite their advanced capabilities, may struggle to fully grasp intricate financial terminologies and concepts, potentially impacting the quality of investment advice.
- Financial advising is heavily regulated. Ensuring that AI-driven recommendations comply with legal standards and ethical guidelines is a significant challenge.
Conclusion
Our exploration of AI in stock trading shows that models like GPT-3 and GPT-4 redefine the landscape, assimilating vast data, applying sophisticated analysis, and offering precise, personalized insights. The stock trading consultant development signifies a leap toward accessible, informed trading for everyone.
Key Takeaways
- The integration of AI into stock trading is not futuristic—it’s here, reshaping how we interact with the stock market.
- AI-driven models like GPT-3 and GPT-4 provide personalized strategies, aligning with individual risk profiles, and financial goals.
- AI harnesses both historical and real-time data to predict market trends and inform investment strategies.
- Sophisticated investment strategies are no longer just for institutional investors; they are now accessible to retail investors, thanks to AI.
- AI empowers investors to make informed decisions, providing a strategic advantage in the volatile realm of stock investment.
Frequently Asked Questions
A. LLMs are AI models that process and generate text. They analyze financial reports, news, and market data in stock trading to provide investment insights and predictions.
A. While AI can augment and enhance the decision-making process, it is not likely to replace human advisors entirely. AI provides data-driven insights, but human advisors consider the emotional and personal aspects of financial planning.
A. AI predictions are based on data and trends, making them quite accurate. However, the stock market is influenced by unpredictable factors, so there’s always an inherent level of uncertainty.
A. AI offers a powerful tool for investment advice, but it should be one of several resources. Diversifying your decision-making tools, including seeking human advice, is recommended for a balanced investment strategy.
A. Begin by using AI-driven platforms or consult with financial advisors who employ AI in their services. Ensure that you understand the AI’s input data and reasoning before following its advice.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.