Crypto Strategy — Linear regression with Binance data — Machine learning. Part 1

John Jairo
Trading Data Analysis
6 min readJan 16, 2023

--

Linear regression is a statistical method for analyzing the relationship between two variables, where one variable is dependent on the other. It can be used to predict future values of a dependent variable based on past values of the independent variable.

Machine learning is a subset of artificial intelligence (AI) that involves the use of algorithms and statistical models to analyze and learn from data. It is increasingly being used in the stock market to make predictions and identify patterns in financial data.

In the context of Bitcoin, linear regression could be used to predict future prices of Bitcoin based on historical price data. However, it’s important to note that Bitcoin is a highly volatile and speculative asset, and its price is influenced by many factors that may not be easily quantifiable or predictable.

One of the main ways that machine learning is used in the stock market is through the creation of predictive models. These models use historical financial data to make predictions about future stock prices and market trends. This can be useful for traders and investors looking to make informed decisions about when to buy or sell stocks.

1. Data Preparation

That's why importing data from Binance helps the process of retrieving financial information, such as trading history, market prices, and account information. This can typically be done using the Binance API, which allows for developing our own Dataframe.

!pip install python-binance pandas mplfinance

from binance import Client, ThreadedWebsocketManager, ThreadedDepthCacheManager
client = Client(apikey, secret)
tickers = client.get_all_tickers()
tickers[1]['price']
ticker_df = pd.DataFrame(tickers)
ticker_df.head()

I am using an unofficial Python wrapper for the Binance exchange REST API python-binance to access data and perform actions on the Binance cryptocurrency exchange platform.

historical = client.get_historical_klines('BTCUSDT', Client.KLINE_INTERVAL_1DAY, '1 Jan 2011')
hist_df = pd.DataFrame(historical)
hist_df.columns = ['Open Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Close Time', 'Quote Asset Volume',
'Number of Trades', 'TB Base Volume', 'TB Quote Volume', 'Ignore']
hist_df.tail()

This library is commonly used by traders and developers to automate their trading strategies and access Binance data for analysis and research. it’s helped me to construct our DataFrame with information with Bitcoin price.

numeric_columns = ['Open', 'High', 'Low', 'Close', 'Volume', 'Quote Asset Volume', 'TB Base Volume', 'TB Quote Volume']
cp=hist_df.drop(columns=[ 'Quote Asset Volume', 'TB Base Volume', 'TB Quote Volume', 'Ignore', 'Number of Trades','Close Time' ])
cp=hist_df.set_index('Open Time')

Once our two-time columns are defined, we can define our set.index() to start working on our DataFrame data to process to be able to develop the price prediction.

2. Data preprocessing and prediction.

Machine Learning can be used to identify trends and patterns in stock prices, volume, and other indicators. This can be useful for traders and investors looking to identify potential opportunities in the market.

Data Preprocessing.

Feeding the model with preprocessed data in a machine-learning model is essential. Raw data contains many errors, and using such data will result in inconsistent and erroneous results.

cp.dropna(inplace=True)
cp.isna().sum()

Scaling

Suppose a feature has a variance of an order of magnitude larger than the other features. In that case, it might dominate the objective function and make the estimator unable to learn from other features correctly. To achieve this, we call the Standard Scaler function.

# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()),
('linear', LinearRegression())]

Pipeline

we define a list containing tuples that specify various machine learning tasks given in the order of execution. Specify in the steps a list of (name, transform) tuples.
We are using the following two steps in our pipeline

  1. Scaling the data.
  2. Fitting the data using the linear regression model.
# Defining pipeline
pipeline = Pipeline(steps)

Hyperparameters

There are some parameters that the model itself cannot estimate. But we still need to account for them as they play a crucial role in increasing the performance of the system. Such parameters are called hyperparameters. We used intercept but you can add more hyperparameters to tune this algorithm.

# Here we are using intercept as hyperparameter
parameters = {'linear__fit_intercept': [0, 1]}

Grid Search Cross-Validation

Cross-validation indicates the model’s performance in a practical situation. It is used to tackle the overfitting of a model. We will use the GridSearchCV function, an inbuilt function for cross-validation.

We have set cv=5, which implies that the grid search will consider five rounds of cross-validation for averaging the performance results. We are using GridSearchCV instead of RandomSearchCV due to fewer features. TimeSeriesSplit splits training data into multiple segments.

# Using TimeSeriesSplit for cross validation
my_cv = TimeSeriesSplit(n_splits=5)

# Defining reg as variable for GridSearch function containing pipeline, hyperparameters
reg = GridSearchCV(pipeline, parameters, cv=my_cv)

Split Train and Test Data

Now, we will split data into train and test data sets.

  1. First, 70% of the data is used for training, and the remaining data for testing.
  2. Fit the training data to a grid search function.
spilitting_ratio = .70

# Splitting the data into two parts
# Using int to make sure integer number comes out.
split = int(spilitting_ratio*len(cp))

# Defining train dataset
X_train = X[:split]
yU_train = yU[:split]
yD_train = yD[:split]

# Defining test data
X_test = X[split:]

Prediction

We will fit the linear regression model on the training dataset and predict the upward deviation in the test dataset.

# Fit the model
reg.fit(X_train, yU_train)

#Result
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('linear', LinearRegression())]),
param_grid={'linear__fit_intercept': [0, 1]})

We can see best_params_ for our model gives linear_fit_intercept equal to one.

Linear Regression

Linear regression uses independent variables to predict a dependent variable using a Linear equation. Here we use X as the independent and yU, yD as the dependent variable.

Here we predict upward deviation using the reg model on the test dataset. We define yU_predict for upward prediction.

# Predict the upward deviation
yU_predict = reg.predict(X_test)

# Fit the model
reg.fit(X_train, yD_train)

# Predict the downward deviation
yD_predict = reg.predict(X_test)
# Predict the upward deviation
yU_predict = reg.predict(X_test)

Now we will create yU_predict and yD_predict columns in the X_test. Formulas for upward deviation and downward deviation are given:

Upward deviation = High - Open // Downward deviation = Open - Low

It is clear from the above two formulas that upward and downward deviations can not be negative. So, we replace negative values with zero.

# Create new column in X_test
X_test['yU_predict'] = yU_predict
X_test['yD_predict'] = yD_predict

# Assign zero to all the negative predicted values to take into account real life conditions
X_test.loc[X_test['yU_predict'] < 0, 'yU_predict'] = 0
X_test.loc[X_test['yD_predict'] < 0, 'yD_predict'] = 0

We will use the predicted upside deviation values to calculate the high price and the predicted downside deviation values to calculate the low price.

# Add open values in ['yU_predict'] to get the predicted high column
X_test['P_H'] = X_test['Open']+X_test['yU_predict']

# Subtract ['yD_predict'] values in open to get the predicted low column.
X_test['P_L'] = X_test['Open']-X_test['yD_predict']

# Print tail of cp dataframe
X_test.tail()

Here we add the Close, High, and Low columns from cp because we will need all these columns to calculate strategy returns in the following notebook.

We are using the split function to get only the test part of the cp.

# Copy columns from cp to X_test
X_test[['Close', 'High', 'Low']] = cp[['Close', 'High', 'Low']][split:]
X_test.tail()

Conclusion

In this blog, we present a system capable of taking values from Binance in order to create variable sets that our system will use to learn regression rules.

In the first 1era. Part I explained how to get the data from Binance to obtain the new variables in the data processing in the next post I will explain the market conditions in which this strategy can be governed.

The system integrates the task of developing a regression model from data with the technique of price prediction to find the logical conditions for opening long or short positions.

Therefore, a simple linear regression model is unlikely to provide very accurate predictions of future Bitcoin prices, but we will elaborate on that hypothesis in the next post with the outcome of the strategy.

Happy trading.

If you enjoyed this article, make sure you check our Twitter & Linkedin.

All the content is educational, It’s not an investment advisor, make your own research, thank you so much to read.

--

--

1/3 Crypto Investor + 1/3 Trading Programmer + 1/3 DeFi Researcher = Research Engineer