0%

特征选择

发表于 2019-11-07 更新于 2019-11-09 分类于计算机视觉阅读次数：

基于变量的数量进行分类：

Univariate methods, variable ranking: consider the input variables (features, attributes) one by one.
Multivariate methods, variable subset selection: consider whole groups of variables together.

基于在选择过程中使用的机器学习方法进行分类：

过滤式（filter）: selects a subset of variables independently of the model that shall subsequently use them.
包裹式（wrapper）: selects a subset of variables taking into account the model that shall use them.
嵌入式（embedded）: the feature selection method is built in the ML model (or rather its training algorithm) itself (e.g. decision trees).

常用的filter方法

过滤方法使用特定的评估标准，如距离，信息，依赖性和一致性等，对特征进行排序，从而进行进行变量选择，称之为”过滤“。过滤方法通常用作数据预处理步骤。特征的选择独立于任何机器学习算法。特征基于统计分数给出排名，统计分数倾向与结果变量相关的特征。

1. F Test (ANOVA)

Scikit-learn 提供了 Selecting K best 个特征，使用F-test.

1	sklearn.feature_selection.f_regression

对于回归问题。

1	sklearn.feature_selection.f_classif

2. Mutual Information

F-test方法只能捕获标签与特征间的线性关系，互信息可以很好的处理非线性关系。

1 2	sklearn.feature_selection.mututal_info_regression sklearn.feature_selection.mututal_info_classif

常用的Wrapper方法

选择一个特征子集，然后评估其建模性能。

1. Forward Search

2. Recursive Feature Elimination

Wrapper 方法通过贪婪搜索选择最好的特征集。缺点是需要训练的大量模型，计算量大。

常用的Embedded方法

1. LASSO Linear Regression

2. Tree based models

参考资料：

[1] Why, How and When to apply Feature Selection

[2] Beginner’s Guide to Feature Selection in Python

[3] Feature selection and extraction

[4] Feature selection – Part I: univariate selection

[5] What is the difference between filter, wrapper, and embedded methods for feature selection?