William Gann's Automating Feature Selection in Trading Models with Scikit-Learn

In the realm of algorithmic trading, the adage "less is more" often holds true. A model with too many features can be prone to overfitting, leading to poor performance on out-of-sample data. Feature selection is the process of selecting a subset of relevant features for use in model construction. Scikit-Learn provides a variety of tools to automate this process.

Univariate Feature Selection

Univariate feature selection methods examine each feature individually to determine the strength of its relationship with the target variable. The most common methods are SelectKBest, which selects the k best features, and SelectPercentile, which selects the best features based on a percentile of the highest scores.

python

from sklearn.feature_selection import SelectKBest, f_classif

# Assume X is our feature matrix and y is our target vector
selector = SelectKBest(f_classif, k=10)
selector.fit(X, y)
X_new = selector.transform(X)

from sklearn.feature_selection import SelectKBest, f_classif

# Assume X is our feature matrix and y is our target vector
selector = SelectKBest(f_classif, k=10)
selector.fit(X, y)
X_new = selector.transform(X)

Recursive Feature Elimination (RFE)

RFE is a more advanced feature selection method that recursively removes the least important features from the model. It works by training a model on the entire set of features and then removing the feature with the lowest importance score. This process is repeated until the desired number of features is reached.

python

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
X_new = selector.transform(X)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
X_new = selector.transform(X)

Tree-Based Feature Selection

Tree-based models, such as Random Forest and Gradient Boosting, can be used for feature selection. These models have a built-in feature importance attribute that can be used to rank the features. The SelectFromModel transformer can be used to select the most important features based on a threshold.

python

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier(n_estimators=100)
selector = SelectFromModel(estimator, threshold='median')
selector.fit(X, y)
X_new = selector.transform(X)

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier(n_estimators=100)
selector = SelectFromModel(estimator, threshold='median')
selector.fit(X, y)
X_new = selector.transform(X)

Comparison of Feature Selection Methods

Method	Pros	Cons
Univariate Selection	Fast, easy to interpret	Ignores feature interactions
Recursive Feature Elimination	Can capture feature interactions	Computationally expensive
Tree-Based Selection	Captures non-linear relationships	Can be biased towards high-cardinality features

Mathematical Formulation: Information Gain

One common metric used for univariate feature selection is Information Gain, which is used in decision trees. The formula for Information Gain is:

$IG(S, A) = H(S) - \sum_{v \in Values(A)} rac{|S_v|}{|S|} H(S_v)$

Where:

$H(S)$ is the entropy of the set S
$A$ is an attribute
$Values(A)$ is the set of all possible values of attribute A
$S_v$ is the subset of S for which attribute A has value v

By automating the feature selection process, you can build more robust and parsimonious trading models that are less prone to overfitting.

Category	William Gann
Read time	5 minutes
Published	Feb 28, 2026

William Gann's Automating Feature Selection in Trading Models with Scikit-Learn

The Black Book of Day Trading Strategies

Univariate Feature Selection

Recursive Feature Elimination (RFE)

Tree-Based Feature Selection

Comparison of Feature Selection Methods

Mathematical Formulation: Information Gain