Main Page > Articles > William Gann > William Gann's Automating Feature Selection in Trading Models with Scikit-Learn

William Gann's Automating Feature Selection in Trading Models with Scikit-Learn

From TradingHabits, the trading encyclopedia · 5 min read · February 28, 2026
The Black Book of Day Trading Strategies
Free Book

The Black Book of Day Trading Strategies

1,000 complete strategies · 31 chapters · Full trade plans

In the realm of algorithmic trading, the adage "less is more" often holds true. A model with too many features can be prone to overfitting, leading to poor performance on out-of-sample data. Feature selection is the process of selecting a subset of relevant features for use in model construction. Scikit-Learn provides a variety of tools to automate this process.

Univariate Feature Selection

Univariate feature selection methods examine each feature individually to determine the strength of its relationship with the target variable. The most common methods are SelectKBest, which selects the k best features, and SelectPercentile, which selects the best features based on a percentile of the highest scores.

python
from sklearn.feature_selection import SelectKBest, f_classif

# Assume X is our feature matrix and y is our target vector
selector = SelectKBest(f_classif, k=10)
selector.fit(X, y)
X_new = selector.transform(X)

Recursive Feature Elimination (RFE)

RFE is a more advanced feature selection method that recursively removes the least important features from the model. It works by training a model on the entire set of features and then removing the feature with the lowest importance score. This process is repeated until the desired number of features is reached.

python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
X_new = selector.transform(X)

Tree-Based Feature Selection

Tree-based models, such as Random Forest and Gradient Boosting, can be used for feature selection. These models have a built-in feature importance attribute that can be used to rank the features. The SelectFromModel transformer can be used to select the most important features based on a threshold.

python
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier(n_estimators=100)
selector = SelectFromModel(estimator, threshold='median')
selector.fit(X, y)
X_new = selector.transform(X)

Comparison of Feature Selection Methods

MethodProsCons
Univariate SelectionFast, easy to interpretIgnores feature interactions
Recursive Feature EliminationCan capture feature interactionsComputationally expensive
Tree-Based SelectionCaptures non-linear relationshipsCan be biased towards high-cardinality features

Mathematical Formulation: Information Gain

One common metric used for univariate feature selection is Information Gain, which is used in decision trees. The formula for Information Gain is:

IG(S, A) = H(S) - \sum_{v \in Values(A)} rac{|S_v|}{|S|} H(S_v)

Where:

  • $H(S)$ is the entropy of the set S
  • $A$ is an attribute
  • $Values(A)$ is the set of all possible values of attribute A
  • $S_v$ is the subset of S for which attribute A has value v

By automating the feature selection process, you can build more robust and parsimonious trading models that are less prone to overfitting.