Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing algorithms that enable computers to learn from and make decisions based on data. Rather than being explicitly programmed to perform a task, ML algorithms build a model based on sample inputs to make predictions or decisions without human intervention. This learning process involves the use of statistical techniques to identify patterns and relationships within the data, thereby enabling the machine to improve its performance over time with more data.
Artificial Intelligence, a term more people are familiar with, encompasses a broader range of techniques, including rule-based systems, natural language processing, and robotics, with the goal of creating systems that can perform tasks typically requiring human intelligence. Machine Learning is a crucial part of AI as it provides the ability to adapt and improve autonomously. In essence, while AI aims to simulate intelligent behaviour, ML is the method by which this intelligence is achieved through data-driven learning, which is perfect for trading and financial markets.
Random Forest Model in Trading Technical AnalysisI’ve written about many AI and ML models and techniques that can be used with trading and financial markets. My last article, “AI Reinforcement Learning with OpenAI’s Gym” may be of interest. I also recommend checking out EODHD API’s Medium page. I use their APIs to provide the financial data to train my models. It’s really easy to use and I also wrote a Python library for them that simplifies data retrieval.
In this article I want to introduce and demonstrate the Random Forest model. The model is a learning method used for classification and regression tasks. It operates by constructing multiple decision trees during training and outputting the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. The ensemble of trees (the forest) mitigates the risk of overfitting to the training data, providing robust and accurate predictions.
In trading technical analysis, the Random Forest model can be particularly useful due to its ability to handle large amounts of data and complex patterns. For example, a trader might use Random Forest to predict stock price movements based on historical price data, volume, and other technical indicators such as moving averages and relative strength index (RSI). By training the model on historical data, it can learn the intricate relationships between these indicators and future price movements.
For instance, suppose a trader uses a dataset containing daily stock or cryptocurrency prices, volume, and technical indicators over the past five years. The Random Forest model can be trained to predict the likelihood of the price increasing or decreasing the next day. By inputting the current day’s data, the model provides a probability that can inform the trader’s decision to buy or sell, potentially improving trading outcomes by leveraging the model’s pattern recognition capabilities. This method not only enhances predictive accuracy but also helps in managing risks by providing a probabilistic assessment of future price movements.
Let’s look at a practical example…The first step is we will need to retrieve some data to work with. For interest sake, I’m going to use Bitcoin’s daily data. What I like about EODHD APIs is it’s fast with little to no retrieval limits. The code below retrieves 1999 days of data.
from eodhd import APIClientWhat I want to do now is add some technical indicators. This really is up to you and part of the fun of experimenting. I’m going to add SMA50, SMA200, MACD, RSI14, and VROC. You can add whatever you prefer here.
def calculate_sma(data, window):This should be self explanatory, but I want to point out something important. You will see that I drop non-numeric rows at the end “dropna”. This is really important as ML models can only handle numeric values. I’m now left with 1800 days of interesting data to work with.
Normalisation and ScalingIf you have read my other articles you will notice that I almost always normalise and scale my data between 0 and 1. This is sort of an exception to the rule. In general, scaling is not a strict requirement when using Random Forests because they are based on decision trees, which are not sensitive to the scale of the input features. However, scaling can still be beneficial in some scenarios, particularly when integrating Random Forests into a pipeline with other algorithms that do require scaling. Additionally, if you plan to interpret feature importances, having scaled data can sometimes make these interpretations more straightforward. For this example I’m not going to run the data through a scaler. You may want to do it, and if you do, I’ve explained how to do it in my previous articles. If you need help, just ask in the comments.
Model TrainingTraining an ML model is actually very straightforward and requires very little code thanks to some essential libraries. You will want to install “scikit-learn” using PIP.
% python3 -m pip install scikit-learn -UWhat you will want to do is split your data into a train set and a test set. I almost always use a 70/30 or 80/20 split. I’ll use a 80/20 split here.
# include these library imports at the top of your fileAnd you can see that the shape of the X_train, X_test, y_train, and y_test looks like this.
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)This is all you need to do to fit your model.
rf = RandomForestRegressor(n_estimators=100, random_state=42)Install the “matplotlib” and “seaborn” libraries using PIP.
% python3 -m pip install matplotlib seaborn -UInclude the libraries in your code.
import matplotlib.pyplot as pltScatter Plot of Actual vs. Predicted Values
plt.figure(figsize=(14, 7))Line Plot of Actual vs. Predicted Values Over Time
plt.figure(figsize=(14, 7))An important task to perform when working with any AI/ML model is to evaluate the performance. This can be very useful when comparing models. There may be more, but the ones I’ve always used are Mean Absolute Error (MAE), Mean Squared Error (MSE), and the R-squared score (R²). They seem to be the most common.
train_mae = mean_absolute_error(y_train, y_train_pred)The result for my model looks like this:
Training MAE: 149.95774584583577Mean Absolute Error (MAE):
MAE measures the average absolute errors between the predicted and actual values. It provides a straightforward measure of how far off predictions are on average.
Mean Squared Error (MSE):
MSE measures the average squared errors between the predicted and actual values. It penalises larger errors more than MAE, making it sensitive to outliers.
R-squared (R²):
R² measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates perfect prediction.
So what does this actually mean and why is it important?
The model performs exceptionally well on the training data, as indicated by the low Training MAE and MSE and the high Training R². This suggests that the model has learned the patterns in the training data very well.
The model also performs very well on the testing data, as indicated by the high Testing R². However, the Testing MAE and MSE are higher compared to the training metrics. This discrepancy suggests some degree of overfitting, where the model might be capturing noise in the training data that does not generalise well to the unseen testing data.
The significant difference between the training and testing errors (both MAE and MSE) suggests that the model may be slightly overfitting the training data. Overfitting occurs when a model learns the training data too well, including its noise and outliers, which negatively impacts its performance on new, unseen data.
What could we do to improve this?
Regularisation: We can consider using techniques to reduce overfitting, such as limiting the maximum depth of the trees, reducing the number of trees, or using other regularisation methods.
Cross-Validation: We can perform cross-validation to ensure that the model’s performance is consistent across different subsets of the data.
Feature Engineering: We can re-evaluate the selected features and possibly introduce new features or reduce the number of features to improve model generalisability. As I explained in the beginning of the article, I just selected some random technical indicators for my tutorial. There could be some interesting features that could be included or swapped out. Maybe percentage change could be one to look at.
Hyperparameter Tuning: We can optimise the hyperparameters of the Random Forest model to balance bias and variance, potentially improving performance on the testing data.
These steps can help in achieving a better balance between training and testing performance, leading to a more robust and generalisable model. I don’t necessarily think we have a huge problem and this is just a tutorial. I just wanted to give you some food for thought about what you can do when trying this out yourself.
Here is some code to help you get started…
# update this import at the topYou will notice the training takes a lot longer now. My iMac which is fairly powerful sounded like it was about to take off it was working so hard :)
Best parameters: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}I just compared the previous results with the new. There was a very marginal improvement. While the changes did not lead to significant improvements in testing performance, they helped in reducing overfitting and stabilising the model’s performance. Further improvements might require additional feature engineering, more sophisticated hyperparameter tuning, or considering different models or techniques.
Feature ImportanceThey driver for exploring this model was to find out how it can be used to determine the importance of certain features in relation to the target.
Install the “pandas” library using PIP.
% python3 -m pip install pandas -UInclude the library in your code.
import pandas as pdfeature_importances = best_rf.feature_importances_Now the way I interpret this is that the technical analysis rules aren’t really being applied, so the features on their own are pretty meaningless.
I’ll give you some examples…
I would say with some feature engineering and to take the technical indicators and create features with the buy and sell signals, you would get a much better response.
I will leave that up to you to experiment with :)
Hint: I’ve done this feature engineering in my other articles if you feel like sneaking a peek.
I hope you found this article interesting and useful. If you would like to be kept informed, please don’t forget to follow me and sign up to my email notifications.
If you liked this article, I recommend checking out EODHD APIs on Medium. They have some interesting articles.Michael WhittleArtificial Intelligence (AI) models for Trading was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.