CN101334843B

CN101334843B - Pattern recognition characteristic extraction method and apparatus

Info

Publication number: CN101334843B
Application number: CN200710118156XA
Authority: CN
Inventors: 西广成; 孙占全
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2007-06-29
Filing date: 2007-06-29
Publication date: 2010-08-25
Anticipated expiration: 2027-06-29
Also published as: CN101334843A

Abstract

The invention discloses a method and a device for feature extraction in pattern recognition, in order to effectively avoid the subjectivity of pre-designated and selected feature numbers in previous feature extraction. The feature extraction method includes the steps of: determining discrete feature variables and class variables according to the pattern original information of the sample, and preprocessing the feature variables and class variables; setting a joint contribution threshold; determining the combination of feature variables and the class variable Joint contribution degree: obtain the combination of feature variables whose joint contribution degree is greater than or equal to the set joint contribution degree threshold. The feature extraction device includes: a numerical preprocessing module, a threshold setting module, a joint contribution degree determination module and a feature extraction module. The feature extraction method and device in pattern recognition of the present invention can be widely used in the feature extraction of discrete digital image information, fingerprint information, facial pattern information, voice information or handwritten/printed character information.

Description

Feature extraction method and device in pattern recognition

技术领域technical field

本发明涉及模式识别领域，特别涉及模式识别中的特征提取方法及装置。The invention relates to the field of pattern recognition, in particular to a feature extraction method and device in pattern recognition.

背景技术Background technique

模式是通过对具体的个别事物进行观测所得到的具有时间和空间分布的信息；把模式所属的类别或同一类中模式的总体称为模式类(或简称为类)。而“模式识别”则是在某些一定量度或观测基础上把待识模式划分到各自的模式类中去。A pattern is information with time and space distribution obtained by observing specific individual things; the category to which a pattern belongs or the overall pattern in the same category is called a pattern class (or class for short). "Pattern recognition" is to divide the patterns to be recognized into their respective pattern categories on the basis of some certain measurements or observations.

模式识别的研究主要集中在两方面，即研究生物体(包括人)是如何感知对象的，以及在给定的任务下，如何用计算机实现模式识别。The research on pattern recognition mainly focuses on two aspects, that is, how to study how organisms (including people) perceive objects, and how to use computers to realize pattern recognition under a given task.

一个计算机模式识别系统基本上由三个相互关联而又有明显区别的过程组成，即数据生成、模式分析和模式分类。数据生成是将输入模式的原始信息进行量化处理，转换为向量，成为计算机易于处理的形式。模式分析是对数据进行加工，包括特征选择、特征提取、数据维数压缩和决定可能存在的类别等。模式分类则是利用模式分析所获得的信息，对计算机进行训练，从而制定判别标准，以期对待识模式进行分类。A computer pattern recognition system basically consists of three interrelated but distinct processes, namely data generation, pattern analysis, and pattern classification. Data generation is to quantify the original information of the input pattern, convert it into a vector, and become a form that can be easily processed by a computer. Pattern analysis is the processing of data, including feature selection, feature extraction, data dimensionality compression, and determination of possible categories. Pattern classification is to use the information obtained by pattern analysis to train the computer so as to formulate discrimination standards in order to classify the recognized patterns.

其中模式分析中的特征提取对于高效的模式分类是非常重要。模式分类涉及到各个领域，如图像分类、语音识别、生物技术、医学等。分类的效率始终是模式分类研究的重要内容，在很多实际问题中，可进行模式分类研究的特征变量非常多，如果将所有可供参考的特征变量都考虑进去进行分类，那么效率将非常低，在实际中无法使用。因此，需要对特征变量进行提取，将经特征提取得到的特征子集作为客观分类器的输入，经过对客观分类器训练，利用特征子集进行分类，从而提高分类的效率。Among them, feature extraction in pattern analysis is very important for efficient pattern classification. Pattern classification involves various fields, such as image classification, speech recognition, biotechnology, medicine, etc. The efficiency of classification is always an important content of pattern classification research. In many practical problems, there are many characteristic variables that can be studied for pattern classification. If all the characteristic variables available for reference are taken into consideration for classification, the efficiency will be very low. Unusable in practice. Therefore, it is necessary to extract the feature variables, and use the feature subset obtained through feature extraction as the input of the objective classifier. After training the objective classifier, the feature subset is used for classification, thereby improving the efficiency of classification.

特征提取是基于搜索一个使信息损失量最小的特征子空间，信息量是通过特征子空间和类变量之间的互信息来度量，特征提取方法不但考虑特征变量与类变量之间的相关性，而且考虑特征变量之间的相关性。Feature extraction is based on searching for a feature subspace that minimizes the amount of information loss. The amount of information is measured by the mutual information between feature subspaces and class variables. Feature extraction methods not only consider the correlation between feature variables and class variables, Moreover, the correlation between feature variables is considered.

特征提取可应用在中医学中。辨证论治是中医的核心，辨证是利用中医理论来理解和诊断疾病的一种方法，证候是未知病因的症状复合体，是机体发生异常的表征。广义的症状不但包括四诊信息，还包括性别、体质、情绪、压力、饮食、生活习惯等众多因素。在辨证过程中，因为有太多的症状体征，医生很难将所有观察到的症状都考虑进去。不同的症状体征在辨证过程中起不同的作用，如何找出包含信息量最大的症状体征集合作为某种证候的辨证标准是中医界非常重要的问题。Feature extraction can be applied in Chinese medicine. Syndrome differentiation and treatment is the core of TCM. Syndrome differentiation is a method of understanding and diagnosing diseases using TCM theory. Syndrome is a complex of symptoms with unknown etiology and a symptom of abnormalities in the body. Symptoms in a broad sense include not only the information of the four diagnoses, but also gender, physical fitness, emotion, stress, diet, living habits and many other factors. In the process of syndrome differentiation, because there are too many symptoms and signs, it is difficult for doctors to take all observed symptoms into account. Different symptoms and signs play different roles in the process of syndrome differentiation. How to find out the set of symptoms and signs with the largest amount of information as a syndrome differentiation standard for a certain syndrome is a very important issue in the field of traditional Chinese medicine.

特征提取同样可应用于数字图像的模式识别。数字图像的模式识别是根据图像的像素灰度值进行模式分类的，一幅图像的像素量很多，如常用的1280×960像素、640×480像素、320×240像素、160×120像素等，如果在模式分类中将所有的像素作为模式分类器的输入，那样的效率将非常低。因此特征提取对于图像的模式分类也是非常重要的研究内容。在图像的特征提取中，将每个像素看作是一个特征变量，选取出对于模式分类最有用的像素作为客观分类器的输入。Feature extraction can also be applied to pattern recognition of digital images. The pattern recognition of digital images is based on the pixel gray value of the image for pattern classification. An image has a lot of pixels, such as commonly used 1280×960 pixels, 640×480 pixels, 320×240 pixels, 160×120 pixels, etc. If all pixels are used as the input of the pattern classifier in pattern classification, the efficiency will be very low. Therefore, feature extraction is also a very important research content for image pattern classification. In image feature extraction, each pixel is regarded as a feature variable, and the most useful pixel for pattern classification is selected as the input of objective classifier.

关于特征变量提取的方法。相关分析是选择信息量大的特征集合的基础，特征变量可以根据它们与类变量的相关度值进行选择。On the method of feature variable extraction. Correlation analysis is the basis for selecting feature sets with a large amount of information, and feature variables can be selected according to their correlation values with class variables.

目前有多种分析相关的统计方法，最简单的方法是相关系数法，但该方法只适用于分析线性相关问题，而许多实际中的问题都是非线性关系。通常使用的非线性统计分析方法是逻辑(logistic)回归法，该方法需要特征变量之间是相互独立的，而实际的很多问题难以满足这个条件。更重要的是logistic回归方法的回归系数不能够直接反映特征变量与类变量之间的相关度值，要用胜算比(odds ratio，OR)值来确定，并且OR值没有实际的物理意义。主成分分析方法和因子分析方法也可用于相关性分析，这两种方法也只能分析变量之间的线性关系，不能度量变量之间任意的相关性。At present, there are many statistical methods for analyzing correlation. The simplest method is the correlation coefficient method, but this method is only suitable for analyzing linear correlation problems, and many practical problems are nonlinear relations. The commonly used non-linear statistical analysis method is the logistic regression method, which requires that the characteristic variables be independent of each other, but many practical problems are difficult to meet this condition. More importantly, the regression coefficient of the logistic regression method cannot directly reflect the correlation value between the feature variable and the class variable, it must be determined by the odds ratio (OR) value, and the OR value has no actual physical meaning. Principal component analysis method and factor analysis method can also be used for correlation analysis. These two methods can only analyze the linear relationship between variables, and cannot measure any correlation between variables.

基于熵的互信息方法则不但可以分析数值变量(离散变量和连续变量)之间的相关性，而且可以度量变量之间的任意相关性。互信息是熵理论中的核心概念之一，是非线性复杂系统自适应性的重要测度，其实质是事物之间的信息传递，随机变量之间的统计相关性，已被应用到很多领域，特别是模式识别领域。The entropy-based mutual information method can not only analyze the correlation between numerical variables (discrete variables and continuous variables), but also measure any correlation between variables. Mutual information is one of the core concepts in entropy theory. It is an important measure of the adaptability of nonlinear complex systems. Its essence is the information transfer between things and the statistical correlation between random variables. It has been applied to many fields, especially is the field of pattern recognition.

与传统方法相比，基于熵的互信息主要有以下优点：Compared with traditional methods, entropy-based mutual information has the following advantages:

1)它既可以度量变量之间线性相关性又可度量变量之间的非线性相关性；1) It can measure both the linear correlation between variables and the nonlinear correlation between variables;

2)与logistic回归非线性分析方法相比，基于熵的互信息方法对分析的变量没有互相独立的条件限制；2) Compared with the logistic regression nonlinear analysis method, the entropy-based mutual information method has no independent conditions for the analyzed variables;

3)基于熵的互信息方法不但可以分析数值变量(离散变量和连续变量)之间的相关性，而且可以度量分级变量、符号变量之间的相关性。3) The entropy-based mutual information method can not only analyze the correlation between numerical variables (discrete variables and continuous variables), but also measure the correlation between hierarchical variables and symbolic variables.

最优的特征选择方法，是将所有的特征组合都评估一遍，这通常会产生组合爆炸问题，因此研究有效的特征提取方法是非常重要的问题。目前，已经有很多学者从事这方面的研究，几种有效的特征提取方法已被提出，用来解决组合问题。但在这些方法中，选择的特征个数通常被预先人为指定，这样势必引入个人的主观性，因此，不是一个好的截尾准则。The optimal feature selection method is to evaluate all feature combinations, which usually leads to a combination explosion problem, so it is very important to study effective feature extraction methods. At present, many scholars have been engaged in research in this area, and several effective feature extraction methods have been proposed to solve the combination problem. But in these methods, the number of selected features is usually pre-specified artificially, which will inevitably introduce personal subjectivity, so it is not a good censoring criterion.

发明内容Contents of the invention

本发明的目的之一在于提一种模式识别中的特征提取方法，能够有效避免预先指定选择的特征个数的主观性。One of the objects of the present invention is to provide a feature extraction method in pattern recognition, which can effectively avoid the subjectivity of pre-specifying the number of features to be selected.

为达到上述目的，本发明采用的技术方案为：In order to achieve the above object, the technical scheme adopted in the present invention is:

该模式识别中的特征提取方法，包括步骤：The feature extraction method in the pattern recognition comprises steps:

根据样本的模式原始信息确定离散的特征变量与类变量，并对该特征变量与类变量进行预处理；Determine the discrete feature variables and class variables according to the model original information of the sample, and preprocess the feature variables and class variables;

设定联合贡献度阈值；Set the joint contribution threshold;

确定特征变量的组合与类变量的联合贡献度；Determine the joint contribution of the combination of feature variables and class variables;

获取所述联合贡献度大于或等于所设定联合贡献度阈值的特征变量的组合。A combination of feature variables whose joint contribution degree is greater than or equal to the set joint contribution degree threshold is acquired.

在现有的特征提取方法中，选择的特征个数通常被预先人为指定，这样势必引入个人的主观性。基于这个问题，本发明提出了一种新的基于互信息的贡献度定义形式，用指定联合贡献度的阈值代替指定特征个数作为特征提取的截尾准则。根据所指定的联合贡献度的阈值，提取联合贡献度大于或等于所设定联合贡献度阈值的特征变量的组合，从而获得一个使信息损失量最小的特征子空间，这样能有效避免以往特征提取中的主观性。In the existing feature extraction methods, the number of selected features is usually pre-specified artificially, which inevitably introduces personal subjectivity. Based on this problem, the present invention proposes a new definition form of contribution degree based on mutual information, and uses the threshold value of the specified joint contribution degree instead of the specified number of features as the censoring criterion for feature extraction. According to the specified joint contribution threshold, extract the combination of feature variables whose joint contribution is greater than or equal to the set joint contribution threshold, so as to obtain a feature subspace that minimizes information loss, which can effectively avoid previous feature extraction. subjectivity in .

本发明的另一目的在于提一种模式识别中的特征提取装置，能够有效避免预先指定选择的特征个数的主观性。Another object of the present invention is to provide a feature extraction device in pattern recognition, which can effectively avoid the subjectivity of specifying the number of features to be selected in advance.

为达到该目的，所采用的技术方案为：In order to achieve this goal, the technical solutions adopted are:

该模式识别中的特征提取装置，包括：The feature extraction device in the pattern recognition includes:

数值预处理模块，用于根据样本的模式原始信息确定离散的特征变量与类变量，并对该特征变量与类变量进行预处理；确定每个特征变量可能的取值，确定类变量可能的取值，设定特征子集，并把该特征子集初始化为空集；The numerical preprocessing module is used to determine discrete feature variables and class variables according to the original information of the sample pattern, and preprocess the feature variables and class variables; determine the possible values of each feature variable, and determine the possible values of class variables value, set the feature subset, and initialize the feature subset to an empty set;

阈值设定模块，用于设定联合贡献度阈值；A threshold setting module, configured to set a joint contribution threshold;

联合贡献度确定模块，用于确定特征子集与类变量的联合贡献度；A joint contribution determination module is used to determine the joint contribution of feature subsets and class variables;

特征提取模块，用于根据该联合贡献度，获取联合贡献度大于或等于所设定联合贡献度阈值的特征子集。The feature extraction module is configured to obtain a feature subset whose joint contribution is greater than or equal to the set joint contribution threshold according to the joint contribution.

在现有的特征提取中，选择的特征个数通常被预先人为指定，这样势必引入个人的主观性。基于这个问题，本发明提出了一种新的基于互信息的贡献度定义形式，用设定模块所预先设定的联合贡献度的阈值代替指定特征个数作为特征提取的截尾准则。通过联合贡献度确定模块来确定特征子集与类变量的联合贡献度，根据设定模块所预先设定的联合贡献度的阈值，由特征提取模块提取联合贡献度大于或等于所设定联合贡献度阈值的特征子集，从而获得一个使信息损失量最小的特征子空间，这样能有效避免以往特征提取中的主观性。In the existing feature extraction, the number of selected features is usually pre-specified artificially, which inevitably introduces individual subjectivity. Based on this problem, the present invention proposes a new definition form of contribution degree based on mutual information, and uses the threshold value of the joint contribution degree preset by the setting module instead of the specified number of features as the censoring criterion for feature extraction. The joint contribution degree of feature subsets and class variables is determined by the joint contribution degree determination module, and according to the threshold value of the joint contribution degree preset by the setting module, the joint contribution degree extracted by the feature extraction module is greater than or equal to the set joint contribution The feature subset of degree threshold is obtained to obtain a feature subspace that minimizes the amount of information loss, which can effectively avoid the subjectivity in previous feature extraction.

附图说明Description of drawings

图1为本发明模式识别方法的流程图；Fig. 1 is the flowchart of pattern recognition method of the present invention;

图2为本发明模式识别装置的系统框图；Fig. 2 is a system block diagram of the pattern recognition device of the present invention;

图3为本发明实施例中每个症状与证候之间的互信息示意图；Fig. 3 is a schematic diagram of mutual information between each symptom and syndrome in the embodiment of the present invention;

图4为本发明实施例中每个症状的贡献度示意图；Fig. 4 is a schematic diagram of the contribution of each symptom in the embodiment of the present invention;

图5为本发明实施例中选择症状的联合贡献度示意图。Fig. 5 is a schematic diagram of the joint contribution of selected symptoms in the embodiment of the present invention.

具体实施方式Detailed ways

为了更好地理解本发明，下面结合附图和具体实施方式对本发明作详细说明。In order to better understand the present invention, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

特征提取是要选择最重要的特征组合，使其信息损失量达到最小，从实用的角度出发，可以节省大量的分类处理时间。Feature extraction is to select the most important feature combination to minimize the amount of information loss. From a practical point of view, it can save a lot of classification processing time.

本发明提出了一种基于新的截尾准则的特征提取方法和装置，这主要是针对离散变量的特征提取。在该特征提取方法和装置中，定义了一种新的基于互信息的联合贡献度形式，用指定联合贡献度的阈值代替指定特征个数作为特征提取的截尾准则，提取联合贡献度大于或等于所设定联合贡献度阈值的特征变量的组合，从而获得一个使信息损失量最小的特征子空间，这样能有效避免以往特征提取中的主观性，同时，本发明提出的基于样本量的计算联合互信息的方法，能大大降低计算量。The present invention proposes a feature extraction method and device based on a new truncation criterion, which is mainly for feature extraction of discrete variables. In the feature extraction method and device, a new form of joint contribution degree based on mutual information is defined, and the threshold value of the specified joint contribution degree is used instead of the specified number of features as the censoring criterion for feature extraction, and the extracted joint contribution degree is greater than or The combination of feature variables equal to the set joint contribution threshold can obtain a feature subspace that minimizes the amount of information loss, which can effectively avoid the subjectivity in the previous feature extraction. At the same time, the calculation based on the sample size proposed by the present invention The method of joint mutual information can greatly reduce the amount of calculation.

一种新的基于互信息的贡献度定义如下：A new contribution degree based on mutual information is defined as follows:

定义：设I(X_i；Y)，i＝1，2，…，n表示每个特征变量与类变量之间的互信息，I(X；Y)表示总的联合互信息，每个特征变量的基于互信息的贡献度定义为：Definition: Let I(X _i ; Y), i=1, 2, ..., n represent the mutual information between each feature variable and class variable, I(X; Y) represents the total joint mutual information, each feature The mutual information-based contribution of variables is defined as:

r_i＝I(X_i；Y)/I(X；Y)，i＝1，2，…，nr _i =I(X _i ;Y)/I(X;Y), i=1, 2, . . . , n

特征变量集X的子集S与类变量Y之间的联合贡献度为：The joint contribution between the subset S of the feature variable set X and the class variable Y is:

r_s＝I(S；Y)/I(X；Y)r _s =I(S;Y)/I(X;Y)

说明：根据基于香农熵互信息的性质，特征变量越多，与类变量之间的互信息越大，因此，贡献度与联合贡献度的取值范围在[0，1]之间。Explanation: According to the nature of mutual information based on Shannon entropy, the more feature variables, the greater the mutual information with class variables. Therefore, the value range of contribution degree and joint contribution degree is between [0, 1].

基于联合贡献度的特征提取具体操作方法介绍如下：The specific operation method of feature extraction based on joint contribution is introduced as follows:

给定一个已选择的特征子集S，该算法从特征集合X中选择下一个特征变量要满足该特征变量加入到S中生成的新的特征子集S←{S，X_i}与类变量之间的互信息最大。一个特征变量要被选择，那么该特征变量所提供的信息不应该在已选的特征子集S中包含。例如，如果两个特征变量X_i和X_j是高度相关的，那么I(X_i；X_j)的值就很大，当其中一个变量被选中，则另一个变量被选中的机会将大大降低。Given a selected feature subset S, the algorithm selects the next feature variable from the feature set X to meet the requirement that the feature variable be added to S to generate a new feature subset S←{S,X _i } and the class variable The mutual information between them is the largest. If a feature variable is to be selected, the information provided by the feature variable should not be included in the selected feature subset S. For example, if two feature variables X _i and X _j are highly correlated, then the value of I(X _i ; X _j ) will be large, and when one of the variables is selected, the chance of the other variable being selected will be greatly reduced .

本发明模式识别中的特征提取方法，包括步骤：根据样本的模式原始信息确定离散的特征变量与类变量，并对该特征变量与类变量进行预处理；设定联合贡献度阈值；确定特征变量的组合与类变量的联合贡献度；获取所述联合贡献度大于或等于所设定联合贡献度阈值的特征变量的组合。The feature extraction method in the pattern recognition of the present invention comprises the steps of: determining discrete feature variables and class variables according to the pattern original information of the sample, and preprocessing the feature variables and class variables; setting a joint contribution threshold; determining the feature variables The combination of the combination and the joint contribution degree of the class variable; obtain the combination of the feature variables whose joint contribution degree is greater than or equal to the set joint contribution degree threshold.

参考图1所示，结合中医的辨证论治问题，本发明模式识别中的特征提取方法，用于对从人体观测到的中间症状信息进行处理，包括如下具体步骤：Referring to Fig. 1, in combination with the problem of syndrome differentiation and treatment in traditional Chinese medicine, the feature extraction method in the pattern recognition of the present invention is used to process the intermediate symptom information observed from the human body, including the following specific steps:

步骤一、根据样本的模式原始信息确定离散的特征变量与类变量，并对该特征变量与类变量进行预处理；将所有特征变量组合为特征变量集，并确定每个特征变量可能的取值；确定类变量可能的取值；设定特征子集，并把该特征子集初始化为空集。Step 1. Determine the discrete feature variables and class variables according to the original information of the sample pattern, and preprocess the feature variables and class variables; combine all feature variables into a feature variable set, and determine the possible value of each feature variable ;Determine the possible values of the class variable; set the feature subset, and initialize the feature subset to an empty set.

分析1022份血瘀证临床数据。在这些数据里记载了71个人体症状，这些症状所对应的取值也就是模式原始信息，所有症状都用离散的特征变量表示，其中，一些症状(特征变量)有两个值，用取值0，1表示，一些症状(特征变量)有四个值，用取值0，1，2，3表示；中医的证候用类变量表示，该类变量有五个值，分别代表中医的五个证候：气虚血瘀、气滞血瘀、阳虚血瘀、痰瘀互阻、瘀血阻络。1022 clinical data of blood stasis syndrome were analyzed. There are 71 human symptoms recorded in these data. The values corresponding to these symptoms are the original information of the model. All symptoms are represented by discrete characteristic variables. Among them, some symptoms (characteristic variables) have two values. 0, 1 means that some symptoms (characteristic variables) have four values, which are represented by the values 0, 1, 2, and 3; syndromes of traditional Chinese medicine are represented by class variables, which have five values, representing five Three syndromes: Qi deficiency and blood stasis, Qi stagnation and blood stasis, Yang deficiency and blood stasis, mutual obstruction of phlegm and blood stasis, and obstruction of collaterals by blood stasis.

步骤二、设定联合贡献度阈值。Step 2: Set the joint contribution threshold.

该阈值的取值范围为[0，1].具体的取值通常根据实际需求进行确定，阈值越大，提取的症状数越多，根据经验，该阈值的取值范围一般为[0.9，0.98]。本实施例中的联合贡献度的阈值指定为0.95。The value range of the threshold is [0, 1]. The specific value is usually determined according to actual needs. The larger the threshold, the more symptoms will be extracted. According to experience, the value range of the threshold is generally [0.9, 0.98 ]. The threshold value of the joint contribution degree in this embodiment is designated as 0.95.

步骤三、确定症状的组合与症候之间的联合贡献度，具体包括如下步骤：Step 3. Determine the combination of symptoms and the joint contribution between symptoms, specifically including the following steps:

S300、确定每个症状与症候之间的互信息；S300. Determine mutual information between each symptom and symptom;

S301、确定使症状与证候之间的互信息最大的症状，将该症状从症状集合中去除，并加入到特征子集中；S301. Determine the symptom that maximizes the mutual information between the symptom and the syndrome, remove the symptom from the symptom set, and add it to the feature subset;

S302、确定该特征子集与证候的联合贡献度。S302. Determine the joint contribution degree of the feature subset and the syndrome.

其中，在步骤S300中，每个症状与证候的互信息是通过公式：Wherein, in step S300, the mutual information of each symptom and syndrome is through the formula:

$I (X_{i}; Y) = H (X_{i}) + H (Y) - H (X_{i}, Y) = Σ_{j = 1}^{m_{i}} Σ_{l = 1}^{k} p (a_{j}^{i}, c_{l}) \log \frac{p (a_{j}^{i}, c_{l})}{p (a_{j}^{i}) p (c_{l})}$ 来确定的。 $I (x_{i}; Y) = h (x_{i}) + h (Y) - h (x_{i}, Y) = Σ_{j = 1}^{m_{i}} Σ_{l = 1}^{k} p (a_{j}^{i}, c_{l}) \log \frac{p (a_{j}^{i}, c_{l})}{p (a_{j}^{i}) p (c_{l})}$ to be sure.

每个症状与证候的互信息公式是这样的得来的：The mutual information formula of each symptom and syndrome is obtained as follows:

设n个特征变量用集合X＝{X₁；X₂；…；X_n}表示，其概率密度函数分别为p(x¹)，p(x²)，…，p(xⁿ)。 $x^{i} &Element; {a_{j}^{i}}, j = 1,2, \cdot \cdot \cdot, m_{i}$ 表示变量X_i(症状)的所有可能值。类变量(证候)用Y表示，它的概率分布用p(y)表示，变量Y有k个可能值y∈{c_i}，i＝1，2，…，k，即意味着所有的特征被映射到k类。X_i和Y的联合概率密度函数用p(xⁱ，y)表示，特征变量X_i的Shannon熵可表示为：Assume that n feature variables are represented by a set X={X ₁ ; X ₂ ; ...; X _n }, and their probability density functions are p(x ¹ ), p(x ² ), ..., p(x ⁿ ). $x^{i} &Element; {a_{j}^{i}}, j = 1,2, &Center Dot; &Center Dot; &Center Dot;, m_{i}$ Denotes all possible values of the variable _Xi (symptom). The class variable (syndrome) is denoted by Y, and its probability distribution is denoted by p(y). The variable Y has k possible values y∈{c _i }, i=1, 2, ..., k, which means that all Features are mapped to k classes. The joint probability density function of Xi and _Y is represented by p( ^xi , y), and the Shannon entropy of feature variable _Xi can be expressed as:

$H h (({X x}_{i i})) = = - - {Σ Σ}_{j j = = 11}^{{m m}_{i i}} p p (({a a}_{j j}^{i i})) log log p p (({a a}_{j j}^{i i}))$

类变量Y的Shannon熵可表示为：The Shannon entropy of class variable Y can be expressed as:

$H h ((Y Y)) = = - - {Σ Σ}_{i i = = 11}^{k k} p p (({c c}_{i i})) log log p p (({c c}_{i i}))$

特征变量X_i和类变量Y之间的联合熵可表示为：The joint entropy between feature variable _Xi and class variable Y can be expressed as:

$H h (({X x}_{i i},, Y Y)) = = - - {Σ Σ}_{j j = = 11}^{{m m}_{i i}} {Σ Σ}_{l l = = 11}^{k k} p p (({a a}_{j j}^{i i},, {c c}_{l l})) log log p p (({a a}_{j j}^{i i},, {c c}_{l l}))$

其中X_i可用特征变量集X的一个子集来代替，即联合熵可推广到n个特征变量的情况。类变量Y与特征变量X_i之间的互信息可表示成：Among them, _Xi can be replaced by a subset of feature variable set X, that is, the joint entropy can be extended to n feature variables. The mutual information between class variable Y and feature variable _Xi can be expressed as:

$I I (({X x}_{i i};; Y Y)) = = H h (({X x}_{i i})) + + H h ((Y Y)) - - H h (({X x}_{i i},, Y Y)) = = {Σ Σ}_{j j = = 11}^{{m m}_{i i}} {Σ Σ}_{l l = = 11}^{k k} p p (({a a}_{j j}^{i i},, {c c}_{l l})) log log \frac{p p (({a a}_{j j}^{i i},, {c c}_{l l}))}{p p (({a a}_{j j}^{i i})) p p (({c c}_{l l}))}$

其中X_i可用特征变量集X的一个子集来代替。Where X _i can be replaced by a subset of feature variable set X.

特征变量、类变量和它们的联合概率分布是通过统计的方法获得的，具体为：Feature variables, class variables and their joint probability distributions are obtained through statistical methods, specifically:

令n个特征变量用集合X＝{X₁，X₂，…，X_n}表示，变量X_i有m_i个值，即 $x^{i} &Element; {a_{j}^{i}}, j = 1,2, \cdot \cdot \cdot, m_{i}$ ，类变量用Y表示，变量Y有k个值，即y∈{c_i}，i＝1，2，…，k，假设我们有N个随机样本T＝{x_i，y_i}∈(A×C)，其中 $X_{i} = (x_{i}^{1}, x_{i}^{2}, \cdot \cdot \cdot, x_{i}^{n}) &Element; A = {a_{j_{1}}^{1}} \times {a_{j_{2}}^{2}} \times \cdot \cdot \cdot \times {a_{j_{n}}^{n}},$ j_i＝1，2，…，m_i，i＝1，2，…，n，y_i∈C＝{c_j}，j＝1，2，…，k。

j＝1，2，…，m_i表示特征变量X_i的值等于

的样本数，N_l，l＝1，2，…，k表示类变量Y的值等于c_l的样本数，

i＝1，2，…，n；j＝1，2，…，m_i；l＝1，2，…，k表示特征变量X_i的值等于

同时类变量Y的值等于c_j的样本数。Let n feature variables be represented by a set X={X ₁ , X ₂ ,...,X _n }, variable X _i has m _i values, namely

x^{i} &Element; {a_{j}^{i}}, j = 1,2, \cdot &Center Dot; &Center Dot;, m_{i}

, the class variable is denoted by Y, the variable Y has k values, that is, y∈{c _i }, i=1, 2,..., k, assuming we have N random samples T={ _xi , y _i }∈( A×C), where

x_{i} = (x_{i}^{1}, x_{i}^{2}, &Center Dot; &Center Dot; &Center Dot;, x_{i}^{no}) &Element; A = {a_{j_{1}}^{1}} \times {a_{j_{2}}^{2}} \times \cdot \cdot \cdot \times {a_{j_{no}}^{no}},

j _i =1, 2, ..., m _i , i = 1, 2, ..., n, y _i ∈ C = {c _j }, j = 1, 2, ..., k.

j=1, 2,..., m _i means that the value of characteristic variable _Xi is equal to

The number of samples of , N _l , l=1, 2,..., k means the value of the class variable Y is equal to the number of samples of c _l ,

i=1, 2,..., n; j=1, 2,..., m _i ; l=1, 2,..., k means that the value of characteristic variable _Xi is equal to

At the same time the value of the class variable Y is equal to the number of samples of c _j .

这时，特征变量、类变量和它们的联合概率分布就可以通过统计的方法获得，即 $p (a_{j}^{i}) = N_{j}^{i} / N;$ p(c_l)＝N_l/N； $p (a_{j}^{i}, c_{l}) = N_{jl}^{i} / N,$ i＝1，2，…，n；j＝1，2，…，m_i；l＝1，2，…，k。同样，特征变量子集S与类变量Y之间的联合概率分布也可获得。At this time, the feature variables, class variables and their joint probability distribution can be obtained by statistical methods, namely $p (a_{j}^{i}) = N_{j}^{i} / N;$ p(c _l ) = N _l /N; $p (a_{j}^{i}, c_{l}) = N_{jl}^{i} / N,$ i=1, 2, . . . , n; j=1, 2, . . . , m _i ; l=1, 2, . . . , k. Likewise, the joint probability distribution between the feature variable subset S and the class variable Y is also available.

通过计算每个症状与证候之间的互信息如图3所示。By calculating the mutual information between each symptom and syndrome, it is shown in Figure 3.

在步骤S300与步骤S301之间还有步骤：从症状集合中除去与证候的互信息小于预定值的症状。There is a step between step S300 and step S301: removing symptoms whose mutual information with syndromes is smaller than a predetermined value from the symptom set.

通过上述互信息计算公式得的到每个症状与症候的互信息后，一些症状的互信息非常小，因此这些症状可以被忽略，对保留下来的症状集合进行特征提取，而且这不会对正确分类产生太大的影响，这样可大大节省特征提取的时间。After the mutual information of each symptom and symptom is obtained through the above mutual information calculation formula, the mutual information of some symptoms is very small, so these symptoms can be ignored, and feature extraction is performed on the retained symptom set, and this will not be correct. Classification has too much impact, which can save a lot of time for feature extraction.

在步骤S302中，特征子集与证候的联合贡献度是通过公式：In step S302, the joint contribution degree of feature subset and syndrome is obtained through the formula:

r_s＝I(S；Y)/I(X；Y)来确定的。r _s =I(S;Y)/I(X;Y) to determine.

其中，r_s表示联合贡献度；Among them, r _s represents the joint contribution;

I(S；Y)表示联合互信息，通过公式：I(S; Y) represents the joint mutual information, through the formula:

$I (S; Y) = H (S) + H (Y) - H (S, Y) = Σ_{j_{1} = 1}^{m_{1}} \cdot \cdot \cdot Σ_{j_{M} = 1}^{m_{M}} Σ_{l = 1}^{k} p (a_{j_{1}}^{1}, \cdot \cdot \cdot, a_{j_{1}}^{M}, c_{l}) \log \frac{p (a_{j_{1}}^{1}, \cdot \cdot \cdot, a_{j_{1}}^{M}, c_{l})}{p (a_{j_{1}}^{1}, \cdot \cdot \cdot, a_{j_{1}}^{M}) p (c_{l})}$ 来确定； $I (S; Y) = h (S) + h (Y) - h (S, Y) = Σ_{j_{1} = 1}^{m_{1}} &Center Dot; &Center Dot; &Center Dot; Σ_{j_{m} = 1}^{m_{m}} Σ_{l = 1}^{k} p (a_{j_{1}}^{1}, &Center Dot; &Center Dot; &Center Dot;, a_{j_{1}}^{m}, c_{l}) \log \frac{p (a_{j_{1}}^{1}, &Center Dot; &Center Dot; &Center Dot;, a_{j_{1}}^{m}, c_{l})}{p (a_{j_{1}}^{1}, &Center Dot; &Center Dot; &Center Dot;, a_{j_{1}}^{m}) p (c_{l})}$ to make sure;

I(X；Y)表示总的联合互信息。I(X;Y) denotes the total joint mutual information.

下面介绍关于总的联合互信息的确定方法。The method for determining the total joint mutual information is introduced below.

根据贡献度的定义，需要计算症状集合与证候之间总的联合互信息，当用常规的互信息计算方法进行计算时，它的计算量非常大，并且当症状很多时会产生组合爆炸。例如，有30个症状，每个症状有4个取值，它们被映射到2类，那么它需要计算大约1.15×10¹⁸个组合值，这在实际运算中是很难完成的。通过统计可以发现，在样本有限的情况下，很多组合的概率为0，因此可通过样本而不考虑具体的症状组合来计算总的联合互信息，下面将介绍该计算方法。According to the definition of the contribution degree, it is necessary to calculate the total joint mutual information between the symptom set and the syndrome. When the conventional mutual information calculation method is used for calculation, the calculation amount is very large, and when there are many symptoms, the combination explosion will occur. For example, there are 30 symptoms, each symptom has 4 values, and they are mapped to 2 categories, then it needs to calculate about 1.15× ¹⁰¹⁸ combined values, which is difficult to complete in actual operation. Through statistics, it can be found that in the case of limited samples, the probability of many combinations is 0, so the total joint mutual information can be calculated through samples without considering specific symptom combinations. The calculation method will be introduced below.

B＝(B₁，B₂，…，B_N)^T是一个频次向量，表示特征变量(症状)的值都相等的样本数，它的计算过程将在下面描述。D＝(D_ij)，i＝1，2，…，N；j＝1，2，…，k是一个频次矩阵，表示特征变量(症状)值都相等，同时类变量(证候)的值也相等的样本数，E＝(E₁，E₂，…，E_k)^T是一个频次向量，表示类变量(证候)的值相等的样本数。该算法可通过下面的步骤来实现：B=(B ₁ , B ₂ , . . . , B _N ) ^T is a frequency vector, representing the number of samples with equal values of characteristic variables (symptoms), and its calculation process will be described below. D=(D _ij ), i=1, 2,..., N; j=1, 2,..., k is a frequency matrix, which means that the values of characteristic variables (symptoms) are all equal, and the values of class variables (syndrome) The number of samples that are also equal, E=(E ₁ , E ₂ , ..., E _k ) ^T is a frequency vector, indicating the number of samples with equal values of class variables (syndrome). This algorithm can be realized through the following steps:

步骤S3031：设训练样本T已知，初始化参数：令向量B的所有元素值为1，令矩阵D和向量E的所有元素值都为0。Step S3031: Assuming that the training sample T is known, initialize parameters: set all elements of vector B to 1, and set all elements of matrix D and vector E to 0.

步骤S3032：下面的程序用来获得计算概率时用到的频次。Step S3032: The following procedure is used to obtain the frequency used in calculating the probability.

设i＝1，2，…，N，j＝i+1，i+2，…，NLet i=1, 2,..., N, j=i+1, i+2,..., N

如果B_i＝0，那么执行下一个循环；If B _i =0, execute the next cycle;

否则otherwise

如果y_i＝c_l，那么E_l＝E_l+1，l＝1，2，…，k；If y _i = c _l , then E _l = E _l + 1, l = 1, 2, . . . , k;

如果x_i＝x_j，那么B_i＝B_i+1，B_j＝0；If x _i =x _j , then B _i =B _i +1, B _j =0;

如果x_i＝x_j和y_i＝c_l，那么D_il＝D_il+1，l＝1，2，…，k。步骤S3033：计算总的联合互信息If x _i =x _j and y _i =c _l , then D _il =D _il +1, l=1, 2, . . . , k. Step S3033: Calculate the total joint mutual information

$I I = = {Σ Σ}_{i i = = 11}^{N N} {Σ Σ}_{j j = = 11}^{k k} \frac{{D D.}_{ij ij}}{N N} log log ((\frac{{D D.}_{ij ij} / / N N}{(({B B}_{i i} / / N N)) (({E E.}_{j j} / / N N))}))$

说明：当D_ij×B_i×E_j等于0时，log(D_ij/B_iE_j)＝0。Note: when D _ij ×B _i ×E _j is equal to 0, log(D _ij /B _i E _j )=0.

利用该算法，很容易计算总的联合互信息I(X；Y)，当样本量不是很大的情况下，计算量可大大的降低。例如，当N＝2000，n＝30，k＝2时，仅需要

循环来计算联合概率，该算法与特征变量(症状)个数和每个特征变量(症状)可能的取值个数无关。Using this algorithm, it is easy to calculate the total joint mutual information I(X; Y), and when the sample size is not very large, the calculation amount can be greatly reduced. For example, when N=2000, n=30, k=2, only need

Cycle to calculate the joint probability, the algorithm has nothing to do with the number of characteristic variables (symptoms) and the number of possible values of each characteristic variable (symptom).

通过计算本实施例中71个症状与证候之间总的联合互信息为1.7342。By calculating the total joint mutual information between 71 symptoms and syndromes in this embodiment is 1.7342.

根据每个特征变量的基于互信息的贡献度的定义，很容易计算每个症状的贡献度，所有症状的单独贡献度如图4所示。According to the definition of mutual information-based contribution of each feature variable, it is easy to calculate the contribution of each symptom, and the individual contributions of all symptoms are shown in Figure 4.

步骤四：获取所述联合贡献度大于或等于所设定联合贡献度阈值的症状的组合，具体包括步骤：Step 4: Obtain the combination of symptoms whose joint contribution is greater than or equal to the set joint contribution threshold, specifically including steps:

将所确定的联合贡献度与所设定的联合贡献度阈值进行比较，Comparing the determined joint contribution with the set joint contribution threshold,

若所确定的联合贡献度大于或等于所设定的联合贡献度阈值，则获取该特征子集；If the determined joint contribution degree is greater than or equal to the set joint contribution degree threshold, then obtain the feature subset;

若所确定的联合贡献度小于所设定的联合贡献度阈值，则对于症状集合的每个症状分别与特征子集的组合，确定使该组合与证候的互信息最大的症状，将该症状从症状集合中去除，并加入到特征子集中；然后回到步骤三往下执行。If the determined joint contribution degree is less than the set joint contribution degree threshold, then for the combination of each symptom of the symptom set and the feature subset, determine the symptom that maximizes the mutual information between the combination and the syndrome, and the symptom Remove from the symptom set and add to the feature subset; then go back to step 3 and execute down.

通过特征提取，9个症状被选择，他们的联合贡献度为0.9711，结果如图5所示。选择的循序依次为急噪易怒，偏身麻木，胸闷，失眠，疲乏无力，职业，舌脉曲张，舌质紫暗，面色黑，这意味这着这9个症状的联合贡献度最大，在诊断这五个症候时包含的信息量最多。Through feature extraction, 9 symptoms are selected, and their joint contribution is 0.9711, and the results are shown in Figure 5. The order of selection is anxious noise and irritability, side numbness, chest tightness, insomnia, fatigue, occupation, varicose veins, dark purple tongue, and dark complexion, which means that the combined contribution of these 9 symptoms is the largest. These five symptoms contain the most information when diagnosing them.

为证明所选择的症状组合信息量最大，有效的方法是用这些症状来辨证，这里选用多类支持向量机进行分类，支持向量机的设置为：惩罚参数C＝20，核函数选为径向基函数，宽度设为σ²＝0.1。863个样本作为训练样本，余下的159个样本作为测试样本，当所有症状作为支持向量机的输入，通过训练，107个样本可以被正确分类，分类正确率为0.6729。当经过特征提取的症状组合作为支持向量机的输入，123个样本可以被正确分类，分类正确率为0.7736。它的正确率高于所有症状作为输入时的正确率是因为在整个症状集合中存在噪音，经过特征提取，噪音可以被降低，因此经过特征提取的症状组合是信息量最大的组合。In order to prove that the selected symptom combination has the largest amount of information, the most effective method is to use these symptoms to differentiate. Here, a multi-class support vector machine is selected for classification. The setting of the support vector machine is: penalty parameter C = 20, and the kernel function is selected as radial Basis function, the width is set to σ ² =0.1. 863 samples are used as training samples, and the remaining 159 samples are used as test samples. When all symptoms are used as the input of the support vector machine, 107 samples can be correctly classified through training, and the classification is correct The rate is 0.6729. When the symptom combination after feature extraction is used as the input of the support vector machine, 123 samples can be correctly classified, and the classification accuracy rate is 0.7736. Its correct rate is higher than that of all symptoms as input because there is noise in the entire symptom set. After feature extraction, the noise can be reduced, so the combination of symptoms after feature extraction is the combination with the largest amount of information.

在该特征提取实例中如果用常规的互信息计算方法进行计算，会发生组合爆炸，实际中无法实现，而根据这里提出的离散变量互信息的快速算法，本特征提取在2个小时左右就可完成。In this feature extraction example, if the conventional mutual information calculation method is used for calculation, a combination explosion will occur, which cannot be realized in practice. However, according to the fast algorithm of discrete variable mutual information proposed here, this feature extraction can be done in about 2 hours. Finish.

本发明另一实例为利用本发明对实时集成电路IC卡数字字符进行识别。Another example of the present invention is to use the present invention to recognize the digital characters of the real-time integrated circuit IC card.

该实例是要实现对生产的IC卡上面打印的卡号进行快速识别，以检验打印的卡号与输入的卡号是否符合。每张卡上打印32个数字，这些打印的数字是由阿拉伯数字0-9组合而成的。This example is to quickly identify the card number printed on the produced IC card, so as to check whether the printed card number matches the input card number. 32 numbers are printed on each card, and these printed numbers are composed of Arabic numerals 0-9.

首先通过图像采集卡对IC卡上打印的数字进行采集，生成数字图像，其次通过图像处理方法将打印的数字分割为32个数字区域，每个数字区域的大小为8×10个像素，然后对每个数字区域进行识别，确定其所对应的数字。每秒钟需要处理6张这样的IC卡。First, the digits printed on the IC card are collected by the image acquisition card to generate a digital image, and then the printed digits are divided into 32 digital areas by image processing method, and the size of each digital area is 8×10 pixels, and then Each number field is identified to determine its corresponding number. Six such IC cards need to be processed per second.

应用本发明模式识别中的特征提取方法，对每个数字区域进行特征提取，包括如下步骤：Apply the feature extraction method in the pattern recognition of the present invention, carry out feature extraction to each digital area, comprise the steps:

S01、根据样本的模式原始信息确定离散的特征变量与类变量，并对该特征变量与类变量进行预处理；将所有特征变量组合为特征变量集，并确定每个特征变量可能的取值；确定类变量可能的取值；设定特征子集，并把该特征子集初始化为空集。S01. Determine discrete feature variables and class variables according to the pattern original information of the sample, and preprocess the feature variables and class variables; combine all feature variables into a feature variable set, and determine the possible value of each feature variable; Determine the possible values of the class variables; set the feature subset, and initialize the feature subset to an empty set.

在这里，模式原始信息为IC卡上数字图像中像素点的灰度值，特征变量为数字图像的像素点，类变量为数字值。每一个特征变量(像素点)有2个灰度值0和1，特征变量集合为80个像素点。数字区域可分成10类，即数字0—9。Here, the original information of the pattern is the gray value of the pixel in the digital image on the IC card, the feature variable is the pixel of the digital image, and the class variable is the digital value. Each feature variable (pixel) has two gray values 0 and 1, and the set of feature variables is 80 pixels. The digital area can be divided into 10 categories, that is, numbers 0-9.

S02、设定联合贡献度阈值。S02. Set a joint contribution threshold.

本实施例中的联合贡献度的阈值指定为0.95The threshold of joint contribution in this embodiment is specified as 0.95

S03，确定像素点的组合与数字之间的联合贡献度，具体包括如下步骤：S03, determining the joint contribution degree between the combination of pixels and numbers, specifically including the following steps:

S031、确定每个像素点与数字之间的互信息；S031. Determine the mutual information between each pixel and the number;

S032、确定使与数字互信息最大的像素点，将该像素点从像素点集合中去除，并加入到特征子集中；S032. Determine the pixel point that maximizes the mutual information with the number, remove the pixel point from the pixel point set, and add it to the feature subset;

S033、确定该特征子集与数字之间的联合贡献度。S033. Determine the joint contribution degree between the feature subset and the number.

其中，在步骤S031中，每个像素点与数字之间的互信息是通过上述公式：Wherein, in step S031, the mutual information between each pixel and the number is through the above formula:

在步骤S031与步骤S032之间还有一步：从像素点集合中除去与数字的互信息小于预定值的像素点。There is another step between step S031 and step S032: removing pixels whose mutual information with numbers is smaller than a predetermined value from the set of pixels.

通过上述互信息计算公式得的到每个像素点与数字的互信息后，一些像素点的互信息非常小，因此这些像素点可以被忽略，对保留下来的像素点集合进行特征提取，而且这不会对正确分类产生太大的影响，这样可大大节省特征提取的时间。After the mutual information between each pixel and the number is obtained through the above mutual information calculation formula, the mutual information of some pixels is very small, so these pixels can be ignored, and feature extraction is performed on the retained set of pixels, and this It will not have much impact on the correct classification, which can greatly save the time of feature extraction.

在步骤S033中，特征子集与数字之间的联合贡献度是通过公式：In step S033, the joint contribution degree between the feature subset and the number is obtained by the formula:

r_s＝I(S；Y)/I(X；Y)来确定的。r _s =I(S;Y)/I(X;Y) to determine.

I(S；Y)表示联合互信息；I(S; Y) represents joint mutual information;

S04，获取所述联合贡献度大于或等于所设定联合贡献度阈值的像素点的组合，具体包括步骤：S04. Obtain the combination of pixels whose joint contribution degree is greater than or equal to the set joint contribution degree threshold, specifically including steps:

若所确定的联合贡献度小于所设定的联合贡献度阈值，则对于像素点集合的每个像素点分别与特征子集的组合，确定使该组合与数字之间的互信息最大的像素点，将该像素点从像素点集合中去除，并加入到特征子集中，然后回到步骤S033往下执行。If the determined joint contribution degree is less than the set joint contribution degree threshold, then for the combination of each pixel point of the pixel point set and the feature subset, determine the pixel point that maximizes the mutual information between the combination and the number , remove the pixel from the pixel set and add it to the feature subset, and then go back to step S033 for execution.

通过该特征提取方法，只有21个像素点就可以达到预期的识别效果，大大提高了IC卡上所打印卡号的识别效率。Through this feature extraction method, only 21 pixel points can achieve the expected recognition effect, which greatly improves the recognition efficiency of the card number printed on the IC card.

如图2所示，本发明还提供一种模式识别中的特征提取装置，包括：As shown in Figure 2, the present invention also provides a feature extraction device in pattern recognition, including:

数值预处理模块10，根据样本的模式原始信息确定离散的特征变量与类变量，并对该特征变量与类变量进行预处理；Numerical preprocessing module 10 determines discrete feature variables and class variables according to the model original information of the sample, and preprocesses the feature variables and class variables;

阈值设定模块20，用于设定联合贡献度阈值；Threshold value setting module 20, used to set the joint contribution degree threshold;

联合贡献度确定模块30，用于确定数值预处理模块所设定的特征子集与类变量的联合贡献度；A joint contribution degree determination module 30, used to determine the joint contribution degree of the feature subset set by the numerical preprocessing module and the class variable;

特征提取模块40，用于根据该联合贡献度，获取联合贡献度大于或等于所设定联合贡献度阈值的特征子集。The feature extraction module 40 is configured to obtain a feature subset whose joint contribution is greater than or equal to the set joint contribution threshold according to the joint contribution.

其中，所述联合贡献度确定模块30包括：Wherein, the joint contribution degree determination module 30 includes:

互信息确定单元301，用于确定每个特征变量与类变量之间的互信息；A mutual information determination unit 301, configured to determine the mutual information between each feature variable and class variable;

最大值确定单元303，用于根据所述互信息，确定使特征变量与类变量之间的互信息最大的特征变量，将该特征变量从特征变量集中去除，并加入到特征变量集的子集中；对于特征变量集的每个特征变量分别与特征子集的组合，确定使该组合与类变量的互信息最大的特征变量，将该特征变量从特征变量集中去除，并加入到特征子集中；The maximum value determination unit 303 is configured to determine the feature variable that maximizes the mutual information between the feature variable and the class variable according to the mutual information, remove the feature variable from the feature variable set, and add it to a subset of the feature variable set ; For the combination of each feature variable in the feature variable set and the feature subset, determine the feature variable that maximizes the mutual information between the combination and the class variable, remove the feature variable from the feature variable set, and add it to the feature subset;

联合贡献度确定单元304，用于确定特征子集与类变量的联合贡献度。A joint contribution degree determining unit 304, configured to determine the joint contribution degree of the feature subset and the class variable.

为了节省特征提取的时间，在所述互信息确定单元与最大值确定单元之间还有一过滤单元302，用于从特征变量集中除去与类变量的互信息小于预定值的特征变量。这样，通过上述互信息计算公式得的到每个症状与症候的互信息后，一些症状的互信息非常小，因此这些症状可以被忽略，对保留下来的症状集合进行特征提取，而且这不会对正确分类产生太大的影响，这样可大大节省特征提取的时间。In order to save time for feature extraction, there is a filter unit 302 between the mutual information determination unit and the maximum value determination unit, which is used to remove feature variables whose mutual information with class variables is less than a predetermined value from the feature variable set. In this way, after the mutual information of each symptom and symptom is obtained through the above mutual information calculation formula, the mutual information of some symptoms is very small, so these symptoms can be ignored, and feature extraction is performed on the retained symptom set, and this will not It has too much impact on the correct classification, which can greatly save the time of feature extraction.

所述特征提取模块40包括：Described feature extraction module 40 comprises:

比较单元401，用于将所确定的联合贡献度与所设定的联合贡献度阈值进行比较；A comparing unit 401, configured to compare the determined joint contribution degree with a set joint contribution degree threshold;

提取单元402，用于提取联合贡献度大于或等于所设定的联合贡献度阈值的特征子集。The extraction unit 402 is configured to extract a feature subset whose joint contribution is greater than or equal to a set joint contribution threshold.

若比较单元401所确定的联合贡献度大于或等于所设定的联合贡献度阈值，则提取单元402将该特征子集；若比较单元401所确定的联合贡献度小于所设定的联合贡献度阈值，则由互信息确定单元301确定特征变量集的每个特征变量分别与特征子集的组合与类变量的互信息，由最大值确定单元303从中确定使该组合与类变量的互信息最大的特征变量，将该特征变量从特征变量集中去除，并加入到特征子集中；然后由联合贡献度确定单元304确定该特征子集联合贡献度。If the joint contribution degree determined by the comparison unit 401 is greater than or equal to the set joint contribution degree threshold, the extraction unit 402 will subset the features; if the joint contribution degree determined by the comparison unit 401 is less than the set joint contribution degree threshold, then the mutual information determination unit 301 determines the mutual information of each feature variable of the feature variable set and the combination of the feature subset and the class variable, and the maximum value determination unit 303 determines from it to maximize the mutual information of the combination and the class variable The feature variable is removed from the feature variable set and added to the feature subset; then the joint contribution determination unit 304 determines the joint contribution of the feature subset.

所述阈值设定模块所设定的联合贡献度阈值的取值范围一般为[0.9，0.98]。The value range of the joint contribution threshold set by the threshold setting module is generally [0.9, 0.98].

本发明模式识别中的特征提取方法与装置，主要是针对离散变量的特征提取。在该特征提取方法和装置中，定义了一种新的联合贡献度形式，这种基于联合贡献度的特征提取方法可有效避免以往特征提取方法预先指定选择的特征个数的主观性，并且可以提高提取的速度，能够广泛应用于离散的数字图像信息、指纹信息、脸纹信息、语音信息或手写/印刷字符信息的等的特征提取。The feature extraction method and device in the pattern recognition of the present invention are mainly aimed at the feature extraction of discrete variables. In the feature extraction method and device, a new form of joint contribution is defined. This feature extraction method based on joint contribution can effectively avoid the subjectivity of the previous feature extraction method specifying the number of selected features in advance, and can Improve the speed of extraction, and can be widely used in the feature extraction of discrete digital image information, fingerprint information, face pattern information, voice information or handwritten/printed character information.

Claims

1. A feature extraction method in pattern recognition is characterized by comprising the following steps:

determining discrete characteristic variables and class variables according to mode original information of the sample, combining all the characteristic variables into a characteristic variable set, and determining possible values of each characteristic variable; determining possible values of the class variables; setting a feature subset, and initializing the feature subset into an empty set;

setting a joint contribution threshold;

determining the joint contribution degree of the feature subset and the class variable;

acquiring a feature subset of which the joint contribution degree is greater than or equal to a set joint contribution degree threshold;

the step of determining the joint contribution degree of the feature subset and the class variable comprises the following steps:

a. determining mutual information between each characteristic variable and each class variable;

b. determining a characteristic variable which enables mutual information between the characteristic variable and the class variable to be maximum, removing the characteristic variable from the characteristic variable set, and adding the characteristic variable into the characteristic subset;

c. determining the joint contribution degree of the feature subset and the class variable;

wherein the joint contribution degree r of the feature subset and the class variable_sThe determination method specifically comprises the following steps:

r_s＝I(S；Y)/I(X；Y)

wherein I (S; Y) represents the joint mutual information of the feature subset and the class variable,

i (X; Y) represents the joint mutual information of all characteristic variables and class variables.

2. The method of extracting features in pattern recognition according to claim 1,

the mode original information is a value corresponding to human body symptoms, the characteristic variable is the human body symptoms, and the class variable is the disease type of the patient; or,

the mode original information is the gray value of pixel points in a digital image on the surface of the integrated circuit card, the characteristic variables are the pixel points of the digital image, and the class variables are digital values.

3. The method of extracting features in pattern recognition according to claim 1, further comprising, between step a and step b, the steps of: and removing the characteristic variables of which mutual information with the class variables is less than a preset value from the characteristic variable set.

4. The method for extracting features in pattern recognition according to claim 3, wherein the joint mutual information of all feature variables and class variables is obtained by sample calculation, and the specific process is as follows:

step 1:

using frequency vector B ═ B₁，B₂，…，B_N)^TN represents the total number of samples in which the values of the characteristic variables are equal;

using frequency matrix D ═ D (D)_ij) The number of samples indicating that the values of the characteristic variables are all equal and the values of the class variables are also equal, i is 1, 2, …, N; j ═ 1, 2, …, k; k represents the number of the values of the class variables;

using frequency vector E ═ E₁，E₂，…，E_k)^TThe number of samples representing equal values of the class variables;

step 2:

initializing parameters: let all the element values of vector B be 1, let all the element values of matrix D and vector E be 0;

and step 3:

frequency used in obtaining the calculated probability:

let i be 1, 2, …, N, j be i +1, i +2, …, N, y_iThe value of the class variable, x, representing the ith sample_iValue representing the i-th sample feature vector, c_lThe l-th value representing a class variable;

if B is present_iIf 0, then the next i loop is executed;

otherwise

If y is_i＝c_lThen E_l＝E_l+1，l＝1，2，…，k；

If x_i＝x_jThen B_i＝B_i+1，B_j＝0；

If x_i＝x_jAnd y_i＝c_lThen D_il＝D_il+1，l＝1，2，…，k；

And 4, step 4:

calculating the joint mutual information of all the characteristic variables and the class variables:

wherein, when D_ij×B_i×E_jWhen equal to 0, let log (D)_ij/B_iE_j)＝0。

5. The method according to claim 1, wherein the step of obtaining the combination of the feature variables whose joint contribution degree is greater than or equal to the set joint contribution degree threshold value includes:

comparing the determined joint contribution degree with the set joint contribution degree threshold,

if the determined joint contribution degree is larger than or equal to the set joint contribution degree threshold value, acquiring the feature subset;

and if the determined joint contribution degree is smaller than the set joint contribution degree threshold, determining the characteristic variable which maximizes the mutual information between the combination and the class variable for the combination of each characteristic variable of the characteristic variable set and the characteristic subset, removing the characteristic variable from the characteristic variable set, and adding the characteristic variable into the characteristic subset.

6. A feature extraction device in pattern recognition, characterized by comprising:

the numerical value preprocessing module is used for determining discrete characteristic variables and class variables according to the mode original information of the sample, combining all the characteristic variables into a characteristic variable set and determining the possible value of each characteristic variable; determining possible values of the class variables; setting a feature subset, and initializing the feature subset into an empty set;

the threshold setting module is used for setting a joint contribution degree threshold;

the joint contribution degree determining module is used for determining the joint contribution degree of the feature subset and the class variable;

the characteristic extraction module is used for acquiring a characteristic subset of which the joint contribution degree is greater than or equal to a set joint contribution degree threshold value according to the joint contribution degree;

wherein the joint contribution degree determination module comprises:

the mutual information determining unit is used for determining the mutual information between each characteristic variable and each class variable;

a maximum value determining unit, configured to determine, according to the mutual information, a feature variable that maximizes the mutual information between the feature variable and the class variable, remove the feature variable from the feature variable set, and add the feature variable to the subset of the feature variable set;

the joint contribution degree determining unit is used for determining the joint contribution degree of the feature subset and the class variable; wherein the joint contribution degree r of the feature subset and the class variable_sThe determination method specifically comprises the following steps:

r_s＝I(S；Y)/I(X；Y)

7. The apparatus according to claim 6, wherein a filter unit is further provided between the mutual information determining unit and the maximum value determining unit, for removing from the feature variable set the feature variables whose mutual information with the class variable is smaller than a predetermined value.

8. The apparatus of claim 6, wherein the feature extraction module comprises:

a comparison unit for comparing the determined joint contribution degree with a set joint contribution degree threshold;

and the extracting unit is used for extracting the feature subset of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold.