Feature selection for ML – what to keep and what not to

Shashank Shekhar
Jan 15, 2021
4 min read

Feature selection is a major step in the Machine Learning pipeline. A feature is an individual measurable property of the process being observed. Using a set of features, any machine learning algorithm can perform classification or regression. Several techniques are developed to address the problem of reducing irrelevant and redundant variables which are a burden on challenging tasks. Feature Selection (variable elimination) helps in understanding data, reducing computation requirement, reducing the effect of curse of dimensionality and improving the predictor performance.

What you learn from the data and how good is it depends on how efficiently the selected features represent the reality. Some of the ways to select the important features from the dataset are:

Remove the correlated features prior to selecting the relevant features because the correlated features make their significance stronger in the training while leaving or reducing the importance of other features, resulting in the model not being able to capture other features correctly.

#Use dataframe to extract correlation matrix:
Cor = your_dataframe.corr()
Then plot it using a sns.heatmap and discard one of the correlated pair by keeping the one which has largest influence on the outcome.

Chi-square test: The Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with the best Chi-square scores. In order to correctly apply the chi-squared in order to test the relation between various features in the dataset and the target variable, the following conditions have to be met: the variables have to be categorical, sampled independently and values should have an expected frequency greater than 5.

From sklearn.feature_selection import SelectKBest,chi2
#Convert X categorical to numeric
X_cat = X.astype(int)
chi2_features=SelectKBest(chi2, k=4)
#This will return 4 features with highest chi2 scores
X_4best = chi2_features.fit_transform(X_cat,Y)

Fisher score is one of the most widely used supervised feature selection methods. The algorithm which we will use returns the ranks of the variables based on the fisher’s score in descending order. We can then select the variables as per the case.

From skfeature.function.similarity_based import fisher_score
Imp=fisher_score(X,Y)
Imp_features=pd.Series(imp, dataframe.columns[(0:len(dataframe.columns)-1])

Use Information Gain (the amount of information gained by knowing the value of a feature) to select the top "k" important features. Obviously, the features with the higher Information Gain are more important and would be selected for the training.

From sklearn.feature_selection import mutual_info_classif
Imp=mutual_info_classif(X,Y)
Imp_features=pd.Series(imp, dataframe.columns[(0:len(dataframe.columns)-1])

Forward Feature Selection. This is an iterative method wherein we start with the best performing variable against the target. Next, we select another variable that gives the best performance in combination with the first selected variable. This process continues until the preset criterion is achieved.
There are few others including PCA (principal component). A fairly smart feature selection can be done by Random Forest. As is obvious, one can apply it only for classification usecases. It’s a bagging algo, so the important features will be placed at the beginning of each tree and hence pruning the tree will retain all the important features. But one should use this only for fairly rich dataset and for models which will be computationally expensive to build.

What if the number of features is greater than the number of data instances (observations)?

In general, you would want to have more data than the number of features in it, in order to accurately learn the model. If your dataset has a high dimension (more features than the data instances), it is very easy for your model to overfit.

This problem of having a high dimensionality is known as The Curse of Dimensionality. Regression methods such as LARS, Lasso or Ridge seem to work well under these circumstances as they tend to shrink the dimension by removing irrelevant features from the dataset. You can also try using Principal Component Analysis (PCA) for Dimensionality Reduction. What if the dataset is skewed (large number of one or few types and very few of other types)?

Having skewed classes in a dataset is not an uncommon problem and can occur when your dataset has one class over-represented. For instance, in detecting a Fraud credit card transaction, a large percentage of the dataset would be authentic transactions made by the cardholder and a very small part would be a fraud transaction. In such a scenario, it could be very easy for your model to almost always predict "genuine" for each transaction, which is not correct. Hence, it is essential for you to check if your dataset is suffering from skewed classes problem and take relevant measures to overcome it. Some of the ways to mitigate this issue are: Collect more data to make it even. Or add more synthetic but not a copy of the data for underrepresented class. Perform under-sampling or bucketize dataset by picking a balanced lot across all the classes to correct for the imbalances. Use one class learning algorithm - Learn to de-noise the data instead of a traditional classification algorithm. One-class learning tries to identify the data belonging to a specific class by learning from the training set with only the observations of that class. Use asymmetric cost function (unequal weighting for different classes) to artificially balance the training process.

Feature selection for ML – what to keep and what not to

Recent Posts

Comments