基于变量的数量进行分类:
Univariate methods, variable ranking: consider the input variables (features, attributes) one by one.
Multivariate methods, variable subset selection: consider whole groups of variables together.
基于在选择过程中使用的机器学习方法进行分类:
- 过滤式(filter): selects a subset of variables independently of the model that shall subsequently use them.
- 包裹式(wrapper): selects a subset of variables taking into account the model that shall use them.
- 嵌入式(embedded): the feature selection method is built in the ML model (or rather its training algorithm) itself (e.g. decision trees).
常用的filter方法
过滤方法使用特定的评估标准,如距离,信息,依赖性和一致性等,对特征进行排序,从而进行进行变量选择,称之为”过滤“。过滤方法通常用作数据预处理步骤。 特征的选择独立于任何机器学习算法。 特征基于统计分数给出排名,统计分数倾向与结果变量相关的特征。
1. F Test (ANOVA)
Scikit-learn 提供了 Selecting K best 个特征,使用F-test.
1 | sklearn.feature_selection.f_regression |
对于回归问题。
1 | sklearn.feature_selection.f_classif |
2. Mutual Information
F-test方法只能捕获标签与特征间的线性关系,互信息可以很好的处理非线性关系。
1 | sklearn.feature_selection.mututal_info_regression |
常用的Wrapper方法
选择一个特征子集,然后评估其建模性能。
1. Forward Search
2. Recursive Feature Elimination
Wrapper 方法通过贪婪搜索选择最好的特征集。缺点是需要训练的大量模型,计算量大。
常用的Embedded方法
1. LASSO Linear Regression
2. Tree based models
参考资料:
[1] Why, How and When to apply Feature Selection
[2] Beginner’s Guide to Feature Selection in Python
[3] Feature selection and extraction
[4] Feature selection – Part I: univariate selection
[5] What is the difference between filter, wrapper, and embedded methods for feature selection?