CN101334843B - Pattern recognition characteristic extraction method and apparatus - Google Patents
Pattern recognition characteristic extraction method and apparatus Download PDFInfo
- Publication number
- CN101334843B CN101334843B CN200710118156XA CN200710118156A CN101334843B CN 101334843 B CN101334843 B CN 101334843B CN 200710118156X A CN200710118156X A CN 200710118156XA CN 200710118156 A CN200710118156 A CN 200710118156A CN 101334843 B CN101334843 B CN 101334843B
- Authority
- CN
- China
- Prior art keywords
- variable
- mrow
- characteristic
- variables
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000003909 pattern recognition Methods 0.000 title claims abstract description 28
- 238000000605 extraction Methods 0.000 title claims description 65
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 208000024891 symptom Diseases 0.000 claims description 71
- 238000004364 calculation method Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 4
- 201000010099 disease Diseases 0.000 claims description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 2
- 208000011580 syndromic disease Diseases 0.000 description 23
- 239000008280 blood Substances 0.000 description 6
- 210000004369 blood Anatomy 0.000 description 6
- 239000003814 drug Substances 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000004069 differentiation Effects 0.000 description 5
- 238000009826 distribution Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000007619 statistical method Methods 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000004880 explosion Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000010219 correlation analysis Methods 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 206010008479 Chest Pain Diseases 0.000 description 1
- 206010062717 Increased upper airway secretion Diseases 0.000 description 1
- 206010022998 Irritability Diseases 0.000 description 1
- 208000032023 Signs and Symptoms Diseases 0.000 description 1
- 208000013738 Sleep Initiation and Maintenance disease Diseases 0.000 description 1
- 206010046996 Varicose vein Diseases 0.000 description 1
- 208000031975 Yang Deficiency Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 206010016256 fatigue Diseases 0.000 description 1
- 206010022437 insomnia Diseases 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000012567 pattern recognition method Methods 0.000 description 1
- 208000026435 phlegm Diseases 0.000 description 1
- 238000012847 principal component analysis method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000035882 stress Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for extracting characteristics in pattern recognition which aims at effectively avoiding subjectivity of pre-selecting the number of characteristics manually in previous characteristic extracting and a device thereof. The characteristic extracting method comprises the steps of: determining and preprocessing discrete characteristic variables and class variables according to original information of a sample pattern, setting a joint contribution rate threshold value, determining a joint contribution rate of a combination of the characteristic variables and the class variables and obtaining the combinations of the characteristic variables with the joint contribution rate more than or equal to the set joint contribution rate threshold value. The characteristic extracting device comprises a numerical value preprocessing module, a threshold value setting module, a joint contribution rate determining module and a characteristic extracting module. The method and device for extracting characteristics in pattern recognition of the invention can be widely used for extracting the characteristics of discrete digital image information, fingerprint information, face print information, voice message or handwritten/printed character information, etc.
Description
Technical Field
The present invention relates to the field of pattern recognition, and in particular, to a method and an apparatus for extracting features in pattern recognition.
Background
Patterns are information with temporal and spatial distribution obtained by observing specific individual things; the class to which a schema belongs or the totality of schemas within the same class is referred to as the schema class (or simply class). The pattern recognition is to classify the pattern to be recognized into respective pattern classes based on some certain measure or observation.
The study of pattern recognition has focused mainly on two aspects, namely on how to study how an organism (including a human) perceives objects and how to implement pattern recognition in a computer under a given task.
A computer pattern recognition system is basically composed of three interrelated and distinct processes, namely data generation, pattern analysis and pattern classification. Data generation is a form in which original information of an input pattern is quantized and converted into a vector, and the vector is easily processed by a computer. The pattern analysis is to process data, including feature selection, feature extraction, data dimension compression, and determination of possible categories. The pattern classification is to train the computer by using the information obtained by the pattern analysis, so as to formulate a discrimination standard to classify the recognition pattern.
Wherein feature extraction in pattern analysis is very important for efficient pattern classification. Pattern classification relates to various fields such as image classification, speech recognition, biotechnology, medicine, and the like. The efficiency of classification is always an important content of pattern classification research, in many practical problems, the characteristic variables capable of being subjected to pattern classification research are very many, and if all the characteristic variables available for reference are taken into consideration for classification, the efficiency is very low and cannot be used in practice. Therefore, it is necessary to extract the feature variables, use the feature subsets obtained by feature extraction as the input of the objective classifier, train the objective classifier, and classify by using the feature subsets, thereby improving the classification efficiency.
The feature extraction is based on searching a feature subspace which minimizes the information loss, the information amount is measured by mutual information between the feature subspace and the class variables, and the feature extraction method considers not only the correlation between the feature variables and the class variables, but also the correlation between the feature variables.
The feature extraction can be applied to the traditional Chinese medicine. Syndrome differentiation is the core of TCM, and is a method of understanding and diagnosing diseases by using TCM theory, and the syndrome is a complex of symptoms of unknown etiology and is a characteristic of abnormal organism. The generalized symptoms include not only the information of four diagnostic methods, but also a plurality of factors such as sex, constitution, mood, stress, diet, living habits and the like. During the differentiation, it is difficult for the physician to take into account all observed symptoms because there are too many signs of symptoms. Different symptoms and signs play different roles in the process of syndrome differentiation, and how to find out the symptom and sign set with the largest information content as the syndrome differentiation standard of a certain syndrome is a very important problem in the traditional Chinese medicine field.
Feature extraction is equally applicable to pattern recognition of digital images. The pattern recognition of digital images is based on the gray value of the pixels of the images, and the amount of pixels of an image is large, such as 1280 × 960 pixels, 640 × 480 pixels, 320 × 240 pixels, 160 × 120 pixels, etc., which are commonly used, and if all the pixels are used as the input of the pattern classifier in the pattern classification, the efficiency is very low. Feature extraction is also a very important research content for pattern classification of images. In the feature extraction of an image, each pixel is regarded as a feature variable, and the pixel most useful for pattern classification is selected as the input of an objective classifier.
A method for extracting characteristic variables. The correlation analysis is the basis for selecting a feature set with a large amount of information, and the feature variables can be selected according to the correlation values of the feature variables and the class variables.
At present, a plurality of statistical methods for analyzing correlation exist, the simplest method is a correlation coefficient method, but the method is only suitable for analyzing linear correlation problems, and many practical problems are nonlinear relations. The commonly used non-linear statistical analysis method is logistic (logistic) regression, which requires that the characteristic variables are independent of each other, and many practical problems have difficulty in satisfying this condition. More importantly, the regression coefficient of the logistic regression method cannot directly reflect the correlation value between the characteristic variable and the class variable, and is determined by an Odds Ratio (OR) value, and the OR value has no practical physical significance. The principal component analysis method and the factor analysis method can also be used for correlation analysis, and the two methods can only analyze the linear relation between variables and cannot measure any correlation between the variables.
The entropy-based mutual information method can analyze the correlation between numerical variables (discrete variables and continuous variables) and measure any correlation between variables. Mutual information is one of core concepts in entropy theory, is an important measure of the self-adaptability of a nonlinear complex system, and the essence of the mutual information is information transfer between things and statistical correlation between random variables, and the mutual information is applied to many fields, particularly the field of pattern recognition.
Compared with the traditional method, the entropy-based mutual information mainly has the following advantages:
1) it can measure both the linear correlation and the non-linear correlation between variables;
2) compared with a logistic regression nonlinear analysis method, the entropy-based mutual information method has no independent condition limit on the analyzed variables;
3) the entropy-based mutual information method can not only analyze the correlation between numerical variables (discrete variables and continuous variables), but also measure the correlation between hierarchical variables and symbolic variables.
The optimal feature selection method is to evaluate all feature combinations once, which usually causes a problem of combination explosion, so that it is very important to research an effective feature extraction method. Currently, many scholars have been engaged in this research, and several effective feature extraction methods have been proposed to solve the problem of combination. However, in these methods, the number of selected features is usually manually specified in advance, which tends to introduce subjectivity to the individual and is therefore not a good criterion for truncation.
Disclosure of Invention
One of the objectives of the present invention is to provide a feature extraction method in pattern recognition, which can effectively avoid the subjectivity of pre-specifying the number of selected features.
In order to achieve the purpose, the invention adopts the technical scheme that:
the feature extraction method in pattern recognition comprises the following steps:
determining discrete characteristic variables and class variables according to the mode original information of the sample, and preprocessing the characteristic variables and the class variables;
setting a joint contribution threshold;
determining the joint contribution degree of the combination of the characteristic variables and the class variables;
and acquiring the combination of the characteristic variables of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold.
In the existing feature extraction method, the number of selected features is generally manually specified in advance, so that the subjectivity of an individual is introduced. Based on the problem, the invention provides a new contribution degree definition form based on mutual information, and the threshold value of the specified joint contribution degree is used for replacing the specified number of the features to be used as the truncation criterion of feature extraction. According to the specified joint contribution degree threshold, the combination of the feature variables of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold is extracted, so that a feature subspace which minimizes the information loss is obtained, and the subjectivity in the conventional feature extraction can be effectively avoided.
Another object of the present invention is to provide a feature extraction device in pattern recognition, which can effectively avoid subjectivity of pre-specifying the number of selected features.
In order to achieve the purpose, the adopted technical scheme is as follows:
the feature extraction device in pattern recognition includes:
the numerical value preprocessing module is used for determining discrete characteristic variables and class variables according to the mode original information of the sample and preprocessing the characteristic variables and the class variables; determining possible values of each characteristic variable, determining possible values of similar variables, setting a characteristic subset, and initializing the characteristic subset into an empty set;
the threshold setting module is used for setting a joint contribution degree threshold;
the joint contribution degree determining module is used for determining the joint contribution degree of the feature subset and the class variable;
and the feature extraction module is used for acquiring a feature subset of which the joint contribution degree is greater than or equal to a set joint contribution degree threshold according to the joint contribution degree.
In the existing feature extraction, the number of selected features is generally manually specified in advance, so that the subjectivity of an individual is introduced. Based on the problem, the invention provides a new contribution degree definition form based on mutual information, and the appointed number of features is replaced by the threshold value of the joint contribution degree preset by a setting module to be used as the truncation criterion of feature extraction. The joint contribution degree of the feature subset and the class variable is determined through a joint contribution degree determining module, and the feature extraction module extracts the feature subset of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold according to the preset joint contribution degree threshold of the setting module, so that a feature subspace enabling the information loss to be minimum is obtained, and subjectivity in the conventional feature extraction can be effectively avoided.
Drawings
FIG. 1 is a flow chart of a pattern recognition method of the present invention;
FIG. 2 is a system block diagram of the pattern recognition apparatus of the present invention;
FIG. 3 is a schematic diagram of mutual information between each symptom and syndrome according to an embodiment of the present invention;
FIG. 4 is a graph showing the contribution of each symptom in the example of the present invention;
FIG. 5 is a graph illustrating the joint contribution of selected symptoms in an embodiment of the present invention.
Detailed Description
For a better understanding of the present invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
The feature extraction is to select the most important feature combination to minimize the information loss, so that a great amount of classification processing time can be saved from the practical viewpoint.
The invention provides a feature extraction method and device based on a new truncation criterion, which mainly aim at feature extraction of discrete variables. In the feature extraction method and the device, a new joint contribution degree form based on mutual information is defined, a threshold value of the specified joint contribution degree is used for replacing the number of the specified features as a truncation criterion of feature extraction, and the combination of the feature variables of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold value is extracted, so that a feature subspace enabling the information loss amount to be minimum is obtained, and therefore subjectivity in the conventional feature extraction can be effectively avoided.
A new mutual information based contribution is defined as follows:
defining: let I (X)i(ii) a Y), I ═ 1, 2, …, n denotes mutual information between each characteristic variable and the class variable, I (X; y) represents the total joint mutual information, and the mutual information-based contribution of each feature variable is defined as:
ri=I(Xi;Y)/I(X;Y),i=1,2,…,n
the joint contribution between the subset S of the set of feature variables X and the class variable Y is:
rs=I(S;Y)/I(X;Y)
description of the drawings: according to the property based on Shannon entropy mutual information, the more the characteristic variables are, the larger the mutual information with the class variables is, and therefore, the value range of the contribution degree and the joint contribution degree is between [0 and 1].
The specific operation method of feature extraction based on joint contribution degree is introduced as follows:
given a selected feature subset S, the algorithm selects from the feature set X the next feature variable to satisfy the new feature subset S ← { S, X ← n, X }, generated by the addition of this feature variable to SiThe mutual information between the } and the class variable is maximum. A feature variable is to be selected and the information provided by the feature variable should not be included in the selected feature subset S. For example, if two characteristic variables XiAnd XjIs highly correlated, then I: (Xi;Xj) The value of (a) is large, and when one of the variables is selected, the chance of the other variable being selected is greatly reduced.
The invention relates to a feature extraction method in pattern recognition, which comprises the following steps: determining discrete characteristic variables and class variables according to the mode original information of the sample, and preprocessing the characteristic variables and the class variables; setting a joint contribution threshold; determining the joint contribution degree of the combination of the characteristic variables and the class variables; and acquiring the combination of the characteristic variables of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold.
Referring to fig. 1, in combination with the problem of treatment based on syndrome differentiation of traditional Chinese medicine, the feature extraction method in pattern recognition of the present invention is used for processing the intermediate symptom information observed from a human body, and comprises the following specific steps:
determining discrete characteristic variables and class variables according to mode original information of a sample, and preprocessing the characteristic variables and the class variables; combining all characteristic variables into a characteristic variable set, and determining the possible value of each characteristic variable; determining possible values of the class variables; a subset of features is set and initialized to an empty set.
1022 parts of blood stasis clinical data were analyzed. The data records 71 human body symptoms, values corresponding to the symptoms are namely mode original information, and all symptoms are expressed by discrete characteristic variables, wherein some symptoms (characteristic variables) have two values and are expressed by values 0 and 1, and some symptoms (characteristic variables) have four values and are expressed by values 0, 1, 2 and 3; the syndrome of traditional Chinese medicine is represented by a class variable, which has five values respectively representing five syndromes of traditional Chinese medicine: qi deficiency and blood stasis, qi stagnation and blood stasis, yang deficiency and blood stasis, phlegm and blood stasis, and blood stasis blocking collaterals.
And step two, setting a joint contribution degree threshold.
The value range of the threshold is [0, 1]. the specific value is usually determined according to the actual requirement, the larger the threshold is, the more the number of extracted symptoms is, and according to the experience, the value range of the threshold is usually [0.9, 0.98 ]. The threshold value of the joint contribution degree in the present embodiment is designated as 0.95.
Step three, determining the joint contribution degree between the combination of symptoms and symptoms, and specifically comprising the following steps:
s300, determining mutual information between each symptom and each symptom;
s301, determining the symptom which enables mutual information between the symptom and the syndrome to be maximum, removing the symptom from the symptom set, and adding the symptom into the feature subset;
s302, determining the joint contribution degree of the feature subset and the syndrome.
In step S300, the mutual information between each symptom and syndrome is represented by the formula:
The mutual information formula of each symptom and syndrome is obtained by the following steps:
let n sets X for feature variables be { X ═ X1;X2;…;XnDenotes that the probability density function is p (x), respectively1),p(x2),…,p(xn)。 <math> <mrow> <msup> <mi>x</mi> <mi>i</mi> </msup> <mo>∈</mo> <mrow> <mo>{</mo> <msubsup> <mi>a</mi> <mi>j</mi> <mi>i</mi> </msubsup> <mo>}</mo> </mrow> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>·</mo> <mo>·</mo> <mo>·</mo> <mo>,</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> </mrow></math> Represents variable XiAll possible values of (symptoms). The class variable (syndrome) is represented by Y, its probability distribution by p (Y), and the variable Y has k possible values Y e { c }iI-1, 2, …, k, meaning that all features are mapped to class k. XiAnd the joint probability density function of Y is p (x)iY) represents a characteristic variable XiThe Shannon entropy of (a) can be expressed as:
the Shannon entropy of the class variable Y can be expressed as:
characteristic variable XiAnd the joint entropy between class variable Y can be expressed as:
wherein XiA subset of the set X of feature variables may be used instead, i.e. the joint entropy may be generalized to the case of n feature variables. Class variable Y and characteristic variable XiThe mutual information between can be expressed as:
wherein XiA subset of the set of characteristic variables X may be used instead.
The characteristic variables, the class variables and the joint probability distribution thereof are obtained by a statistical method, and specifically comprise the following steps:
let n sets X for feature variables be { X ═ X1,X2,…,XnDenotes, variable XiHas miValue, i.e. <math> <mrow> <msup> <mi>x</mi> <mi>i</mi> </msup> <mo>∈</mo> <mrow> <mo>{</mo> <msubsup> <mi>a</mi> <mi>j</mi> <mi>i</mi> </msubsup> <mo>}</mo> </mrow> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>·</mo> <mo>·</mo> <mo>·</mo> <mo>,</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> </mrow></math> The class variable is represented by Y, and the variable Y has k values, i.e. Y ∈ { c ∈i1, 2, …, k, assuming we have N random samples T ═ xi,yiE (A × C), wherein <math> <mrow> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mn>1</mn> </msubsup> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>,</mo> <mo>·</mo> <mo>·</mo> <mo>·</mo> <mo>,</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mi>n</mi> </msubsup> <mo>)</mo> </mrow> <mo>∈</mo> <mi>A</mi> <mo>=</mo> <mrow> <mo>{</mo> <msubsup> <mi>a</mi> <msub> <mi>j</mi> <mn>1</mn> </msub> <mn>1</mn> </msubsup> <mo>}</mo> </mrow> <mo>×</mo> <mrow> <mo>{</mo> <msubsup> <mi>a</mi> <msub> <mi>j</mi> <mn>2</mn> </msub> <mn>2</mn> </msubsup> <mo>}</mo> </mrow> <mo>×</mo> <mo>·</mo> <mo>·</mo> <mo>·</mo> <mo>×</mo> <mrow> <mo>{</mo> <msubsup> <mi>a</mi> <msub> <mi>j</mi> <mi>n</mi> </msub> <mi>n</mi> </msubsup> <mo>}</mo> </mrow> <mo>,</mo> </mrow></math> ji=1,2,…,mi,i=1,2,…,n,yi∈C={cj},j=1,2,…,k。j=1,2,…,miRepresenting a characteristic variable XiIs equal toNumber of samples of (1), Nl1, 2, …, k indicating that the value of the class variable Y is equal to clThe number of samples of (a) to (b),i=1,2,…,n;j=1,2,…,mi(ii) a l 1, 2, …, k denotes the characteristic variable XiIs equal toWhile the value of the class variable Y is equal to cjThe number of samples of (1).
In this case, the characteristic variables, the class variables and their joint probability distributions can be obtained by statistical methods, i.e. p(cl)=Nl/N; i=1,2,…,n;j=1,2,…,mi(ii) a l is 1, 2, …, k. Also, a joint probability distribution between the subset of feature variables S and the class variables Y is obtained.
The mutual information between each symptom and syndrome is calculated as shown in fig. 3.
Between step S300 and step S301, there is further the step of: and removing the symptom with mutual information with the syndrome smaller than a preset value from the symptom set.
After mutual information of each symptom and symptom is obtained through the mutual information calculation formula, mutual information of some symptoms is very small, so the symptoms can be ignored, feature extraction is carried out on the reserved symptom set, and the influence on correct classification cannot be too large, so that the time for feature extraction can be greatly saved.
In step S302, the joint contribution of the feature subset and the syndrome is represented by the formula:
rsi (S; Y)/I (X; Y).
Wherein r issRepresenting a joint contribution degree;
i (S; Y) represents the joint mutual information, by the formula:
i (X; Y) represents the total joint mutual information.
The determination method regarding the overall joint mutual information is described below.
According to the definition of the contribution degree, total joint mutual information between symptom sets and syndromes needs to be calculated, when the calculation is carried out by using a conventional mutual information calculation method, the calculation amount is very large, and combined explosion can be generated when the symptoms are many. For example, if there are 30 symptoms, each symptom has 4 values, and they are mapped to class 2, then it needs to be calculated about 1.15 × 1018A combined value, which is difficult to accomplish in practical operations. It can be found by statistics that in the case of limited samples, the probability of many combinations is 0, so that the total joint mutual information can be calculated by samples without considering specific symptom combinations, and the calculation method will be described below.
B=(B1,B2,…,BN)TIs a frequency vector representing the number of samples in which the values of the characteristic variables (symptoms) are all equal, and the calculation process thereof will be described below. D ═ D (D)ij) I ═ 1, 2, …, N; j ═ 1, 2, …, k is a frequency matrix representing the number of samples in which the values of characteristic variables (symptoms) are all equal and the values of class variables (syndromes) are also equal, and E ═ E (E ═ E)1,E2,…,Ek)TIs a frequency vector representing the number of samples with equal values of the class variables (syndromes). The algorithm may be implemented by the following steps:
step S3031: assuming that the training sample T is known, initializing parameters: let all the element values of vector B be 1 and let all the element values of matrix D and vector E be 0.
Step S3032: the following procedure was used to obtain the frequencies used in calculating the probabilities.
Let i be 1, 2, …, N, j be i +1, i +2, …, N
If B is presentiIf 0, then the next cycle is executed;
otherwise
If y isi=clThen El=El+1,l=1,2,…,k;
If xi=xjThen Bi=Bi+1,Bj=0;
If xi=xjAnd yi=clThen Dil=Dil+1, l ═ 1, 2, …, k. Step S3033: computing total joint mutual information
Description of the drawings: when D is presentij×Bi×EjWhen equal to 0, log (D)ij/BiEj)=0。
By utilizing the algorithm, the total joint mutual information I (X; Y) can be easily calculated, and the calculation amount can be greatly reduced under the condition that the sample size is not large. For example, when N is 2000, N is 30, and k is 2, only one need be usedAnd (4) circulating to calculate the joint probability, wherein the algorithm is independent of the number of characteristic variables (symptoms) and the possible values of each characteristic variable (symptom).
The total joint mutual information between the 71 symptoms and syndromes in this embodiment is calculated to be 1.7342.
From the definition of mutual information-based contribution degrees of each feature variable, the contribution degree of each symptom is easily calculated, and the individual contribution degrees of all symptoms are as shown in fig. 4.
Step four: acquiring a combination of symptoms of which the joint contribution degree is greater than or equal to a set joint contribution degree threshold, specifically comprising the steps of:
comparing the determined joint contribution degree with the set joint contribution degree threshold,
if the determined joint contribution degree is larger than or equal to the set joint contribution degree threshold value, acquiring the feature subset;
if the determined joint contribution degree is smaller than the set joint contribution degree threshold, determining the symptom which maximizes the mutual information between the combination and the syndrome for the combination of each symptom of the symptom set and the characteristic subset, removing the symptom from the symptom set, and adding the symptom into the characteristic subset; and then returning to the step three to execute the next step.
By feature extraction, 9 symptoms were selected, and their joint contribution was 0.9711, with the results shown in fig. 5. The selected sequence is in turn urgent noise, irritability, hemianesthesia, chest distress, insomnia, fatigue, weakness, occupation, tongue varicosity, dark purple tongue and black complexion, which means that the combined contribution degree of the 9 symptoms is the largest, and the information content is the largest when the five symptoms are diagnosed.
To prove that the selected symptom combination has the largest information amount, the effective method is to use the symptoms to distinguish syndromes, wherein a multi-class support vector machine is selected for classification, and the support vector machine is set as follows: the penalty parameter C is 20, the kernel function is selected as a radial basis function, and the width is set as sigma20.1. 863 samples are used as training samples, the remaining 159 samples are used as test samples, and when all symptoms are used as input of the support vector machine, 107 samples can be correctly classified through training, and the classification accuracy is 0.6729. When the symptom combination subjected to feature extraction is used as the input of the support vector machine, 123 samples can be correctly classified, and the classification accuracy is 0.7736. Its accuracy is higher than that of all symptoms as input because there is noise in the whole symptom set, and the noise can be reduced by feature extraction, so the symptom combination by feature extraction is the combination with the largest information amount.
In the feature extraction example, if the conventional mutual information calculation method is used for calculation, combined explosion can occur, and the method cannot be realized in practice, but according to the rapid algorithm of the discrete variable mutual information, the feature extraction can be completed within about 2 hours.
The invention also discloses a method for recognizing the digital characters of the real-time integrated circuit IC card.
This example is to realize rapid identification of the card number printed on the produced IC card to check whether the printed card number matches the input card number. Each card is printed with 32 digits, which are a combination of arabic numerals 0-9.
Firstly, collecting the printed digit on the IC card by an image collecting card to generate a digital image, secondly, dividing the printed digit into 32 digit areas by an image processing method, wherein the size of each digit area is 8 multiplied by 10 pixels, and then identifying each digit area to determine the corresponding digit. 6 such IC cards need to be handled per second.
The feature extraction method in the pattern recognition of the invention is applied to extract the features of each digital area, and comprises the following steps:
s01, determining discrete characteristic variables and class variables according to the mode original information of the sample, and preprocessing the characteristic variables and the class variables; combining all characteristic variables into a characteristic variable set, and determining the possible value of each characteristic variable; determining possible values of the class variables; a subset of features is set and initialized to an empty set.
Here, the pattern primitive information is a gray value of a pixel point in a digital image on the IC card, the feature variable is a pixel point of the digital image, and the class variable is a digital value. Each feature variable (pixel) has 2 gray values 0 and 1, and the feature variable set is 80 pixels. The number area can be divided into 10 categories, i.e., numbers 0-9.
And S02, setting a joint contribution degree threshold value.
The threshold value of the joint contribution degree in the present embodiment is specified as 0.95
S03, determining the joint contribution degree between the combination of the pixel points and the number, specifically comprising the following steps:
s031, confirm the mutual information between each pixel and figure;
s032, determining a pixel point which maximizes mutual information with the number, removing the pixel point from the pixel point set, and adding the pixel point into the feature subset;
and S033, determining the joint contribution degree between the feature subset and the number.
In step S031, the mutual information between each pixel and the number is represented by the formula:
There is a step between step S031 and step S032: and removing the pixel points of which the mutual information with the number is less than a preset value from the pixel point set.
After the mutual information of each pixel point and the number is obtained through the mutual information calculation formula, the mutual information of some pixel points is very small, so the pixel points can be ignored, the reserved pixel point set is subjected to feature extraction, and the feature extraction cannot be greatly influenced on correct classification, so the feature extraction time can be greatly saved.
In step S033, the joint contribution between the feature subset and the number is determined by the formula:
rsi (S; Y)/I (X; Y).
Wherein r issRepresenting a joint contribution degree;
i (S; Y) represents joint mutual information;
i (X; Y) represents the total joint mutual information.
S04, acquiring a combination of the pixels having the joint contribution degree greater than or equal to the set joint contribution degree threshold, specifically including the steps of:
comparing the determined joint contribution degree with the set joint contribution degree threshold,
if the determined joint contribution degree is larger than or equal to the set joint contribution degree threshold value, acquiring the feature subset;
if the determined joint contribution degree is smaller than the set joint contribution degree threshold, determining a pixel point which maximizes the mutual information between each pixel point of the pixel point set and the feature subset for the combination of each pixel point of the pixel point set and the feature subset, removing the pixel point from the pixel point set, adding the pixel point into the feature subset, and then returning to step S033 to be executed.
By the feature extraction method, the expected identification effect can be achieved only by 21 pixel points, and the identification efficiency of the card number printed on the IC card is greatly improved.
As shown in fig. 2, the present invention also provides a feature extraction device in pattern recognition, including:
the numerical value preprocessing module 10 is used for determining discrete characteristic variables and class variables according to the mode original information of the sample and preprocessing the characteristic variables and the class variables;
a threshold setting module 20, configured to set a joint contribution threshold;
a joint contribution degree determining module 30, configured to determine a joint contribution degree between the feature subset set by the numerical preprocessing module and the class variable;
and the feature extraction module 40 is configured to obtain a feature subset of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold according to the joint contribution degree.
Wherein the joint contribution degree determining module 30 includes:
a mutual information determining unit 301, configured to determine mutual information between each feature variable and each class variable;
a maximum value determining unit 303, configured to determine, according to the mutual information, a feature variable that maximizes the mutual information between the feature variable and the class variable, remove the feature variable from the feature variable set, and add the feature variable to the subset of the feature variable set; for the combination of each characteristic variable of the characteristic variable set and the characteristic subset, determining the characteristic variable which enables the mutual information of the combination and the class variable to be maximum, removing the characteristic variable from the characteristic variable set, and adding the characteristic variable into the characteristic subset;
a joint contribution degree determining unit 304, configured to determine a joint contribution degree of the feature subset and the class variable.
In order to save the time of feature extraction, a filter unit 302 is further arranged between the mutual information determining unit and the maximum value determining unit, and is used for removing feature variables of which the mutual information with class variables is smaller than a preset value from the feature variable set. Therefore, after mutual information of each symptom and symptom is obtained through the mutual information calculation formula, mutual information of some symptoms is very small, so that the symptoms can be ignored, feature extraction is carried out on the reserved symptom set, and the influence on correct classification is not large, so that the time for feature extraction can be greatly saved.
The feature extraction module 40 includes:
a comparing unit 401, configured to compare the determined joint contribution degree with a set joint contribution degree threshold;
an extracting unit 402, configured to extract a feature subset with a joint contribution degree greater than or equal to the set joint contribution degree threshold.
If the joint contribution degree determined by the comparing unit 401 is greater than or equal to the set joint contribution degree threshold, the extracting unit 402 performs the feature subset; if the joint contribution determined by the comparing unit 401 is smaller than the set joint contribution threshold, the mutual information determining unit 301 determines the mutual information of the combination of each feature variable of the feature variable set and the feature subset and the class variable, the maximum determining unit 303 determines the feature variable which maximizes the mutual information of the combination and the class variable, and the feature variable is removed from the feature variable set and added to the feature subset; the feature subset joint contribution degree is then determined by the joint contribution degree determination unit 304.
The value range of the joint contribution threshold set by the threshold setting module is generally [0.9, 0.98 ].
The invention relates to a feature extraction method and a feature extraction device in pattern recognition, which mainly aim at feature extraction of discrete variables. In the feature extraction method and the device, a new joint contribution degree form is defined, the feature extraction method based on the joint contribution degree can effectively avoid the subjectivity of the feature number which is pre-specified and selected by the conventional feature extraction method, can improve the extraction speed, and can be widely applied to the feature extraction of discrete digital image information, fingerprint information, face print information, voice information, handwritten/printed character information and the like.
Claims (8)
1. A feature extraction method in pattern recognition is characterized by comprising the following steps:
determining discrete characteristic variables and class variables according to mode original information of the sample, combining all the characteristic variables into a characteristic variable set, and determining possible values of each characteristic variable; determining possible values of the class variables; setting a feature subset, and initializing the feature subset into an empty set;
setting a joint contribution threshold;
determining the joint contribution degree of the feature subset and the class variable;
acquiring a feature subset of which the joint contribution degree is greater than or equal to a set joint contribution degree threshold;
the step of determining the joint contribution degree of the feature subset and the class variable comprises the following steps:
a. determining mutual information between each characteristic variable and each class variable;
b. determining a characteristic variable which enables mutual information between the characteristic variable and the class variable to be maximum, removing the characteristic variable from the characteristic variable set, and adding the characteristic variable into the characteristic subset;
c. determining the joint contribution degree of the feature subset and the class variable;
wherein the joint contribution degree r of the feature subset and the class variablesThe determination method specifically comprises the following steps:
rs=I(S;Y)/I(X;Y)
wherein I (S; Y) represents the joint mutual information of the feature subset and the class variable,
i (X; Y) represents the joint mutual information of all characteristic variables and class variables.
2. The method of extracting features in pattern recognition according to claim 1,
the mode original information is a value corresponding to human body symptoms, the characteristic variable is the human body symptoms, and the class variable is the disease type of the patient; or,
the mode original information is the gray value of pixel points in a digital image on the surface of the integrated circuit card, the characteristic variables are the pixel points of the digital image, and the class variables are digital values.
3. The method of extracting features in pattern recognition according to claim 1, further comprising, between step a and step b, the steps of: and removing the characteristic variables of which mutual information with the class variables is less than a preset value from the characteristic variable set.
4. The method for extracting features in pattern recognition according to claim 3, wherein the joint mutual information of all feature variables and class variables is obtained by sample calculation, and the specific process is as follows:
step 1:
using frequency vector B ═ B1,B2,…,BN)TN represents the total number of samples in which the values of the characteristic variables are equal;
using frequency matrix D ═ D (D)ij) The number of samples indicating that the values of the characteristic variables are all equal and the values of the class variables are also equal, i is 1, 2, …, N; j ═ 1, 2, …, k; k represents the number of the values of the class variables;
using frequency vector E ═ E1,E2,…,Ek)TThe number of samples representing equal values of the class variables;
step 2:
initializing parameters: let all the element values of vector B be 1, let all the element values of matrix D and vector E be 0;
and step 3:
frequency used in obtaining the calculated probability:
let i be 1, 2, …, N, j be i +1, i +2, …, N, yiThe value of the class variable, x, representing the ith sampleiValue representing the i-th sample feature vector, clThe l-th value representing a class variable;
if B is presentiIf 0, then the next i loop is executed;
otherwise
If y isi=clThen El=El+1,l=1,2,…,k;
If xi=xjThen Bi=Bi+1,Bj=0;
If xi=xjAnd yi=clThen Dil=Dil+1,l=1,2,…,k;
And 4, step 4:
calculating the joint mutual information of all the characteristic variables and the class variables:
wherein, when Dij×Bi×EjWhen equal to 0, let log (D)ij/BiEj)=0。
5. The method according to claim 1, wherein the step of obtaining the combination of the feature variables whose joint contribution degree is greater than or equal to the set joint contribution degree threshold value includes:
comparing the determined joint contribution degree with the set joint contribution degree threshold,
if the determined joint contribution degree is larger than or equal to the set joint contribution degree threshold value, acquiring the feature subset;
and if the determined joint contribution degree is smaller than the set joint contribution degree threshold, determining the characteristic variable which maximizes the mutual information between the combination and the class variable for the combination of each characteristic variable of the characteristic variable set and the characteristic subset, removing the characteristic variable from the characteristic variable set, and adding the characteristic variable into the characteristic subset.
6. A feature extraction device in pattern recognition, characterized by comprising:
the numerical value preprocessing module is used for determining discrete characteristic variables and class variables according to the mode original information of the sample, combining all the characteristic variables into a characteristic variable set and determining the possible value of each characteristic variable; determining possible values of the class variables; setting a feature subset, and initializing the feature subset into an empty set;
the threshold setting module is used for setting a joint contribution degree threshold;
the joint contribution degree determining module is used for determining the joint contribution degree of the feature subset and the class variable;
the characteristic extraction module is used for acquiring a characteristic subset of which the joint contribution degree is greater than or equal to a set joint contribution degree threshold value according to the joint contribution degree;
wherein the joint contribution degree determination module comprises:
the mutual information determining unit is used for determining the mutual information between each characteristic variable and each class variable;
a maximum value determining unit, configured to determine, according to the mutual information, a feature variable that maximizes the mutual information between the feature variable and the class variable, remove the feature variable from the feature variable set, and add the feature variable to the subset of the feature variable set;
the joint contribution degree determining unit is used for determining the joint contribution degree of the feature subset and the class variable; wherein the joint contribution degree r of the feature subset and the class variablesThe determination method specifically comprises the following steps:
rs=I(S;Y)/I(X;Y)
wherein I (S; Y) represents the joint mutual information of the feature subset and the class variable,
i (X; Y) represents the joint mutual information of all characteristic variables and class variables.
7. The apparatus according to claim 6, wherein a filter unit is further provided between the mutual information determining unit and the maximum value determining unit, for removing from the feature variable set the feature variables whose mutual information with the class variable is smaller than a predetermined value.
8. The apparatus of claim 6, wherein the feature extraction module comprises:
a comparison unit for comparing the determined joint contribution degree with a set joint contribution degree threshold;
and the extracting unit is used for extracting the feature subset of which the joint contribution degree is greater than or equal to the set joint contribution degree threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200710118156XA CN101334843B (en) | 2007-06-29 | 2007-06-29 | Pattern recognition characteristic extraction method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200710118156XA CN101334843B (en) | 2007-06-29 | 2007-06-29 | Pattern recognition characteristic extraction method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101334843A CN101334843A (en) | 2008-12-31 |
CN101334843B true CN101334843B (en) | 2010-08-25 |
Family
ID=40197432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200710118156XA Expired - Fee Related CN101334843B (en) | 2007-06-29 | 2007-06-29 | Pattern recognition characteristic extraction method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101334843B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105574351B (en) * | 2015-12-31 | 2017-02-15 | 北京千安哲信息技术有限公司 | Medical data processing method |
CN112559591B (en) * | 2020-12-08 | 2023-06-13 | 晋中学院 | Outlier detection system and detection method for cold roll manufacturing process |
CN113780481B (en) * | 2021-11-11 | 2022-04-08 | 中国南方电网有限责任公司超高压输电公司广州局 | Monitoring method and device for power equipment, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1617161A (en) * | 2003-11-10 | 2005-05-18 | 北京握奇数据系统有限公司 | Finger print characteristic matching method based on inter information |
CN1631321A (en) * | 2003-12-23 | 2005-06-29 | 中国科学院自动化研究所 | Multiple modality medical image registration method based on mutual information sensitive range |
-
2007
- 2007-06-29 CN CN200710118156XA patent/CN101334843B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1617161A (en) * | 2003-11-10 | 2005-05-18 | 北京握奇数据系统有限公司 | Finger print characteristic matching method based on inter information |
CN1631321A (en) * | 2003-12-23 | 2005-06-29 | 中国科学院自动化研究所 | Multiple modality medical image registration method based on mutual information sensitive range |
Non-Patent Citations (2)
Title |
---|
孙占成,西广成,易建强,李海霞.中医辩证的智能系统模型.系统仿真学报19 10.2007,19(10),2318-2320,2391. |
孙占成,西广成,易建强,李海霞.中医辩证的智能系统模型.系统仿真学报19 10.2007,19(10),2318-2320,2391. * |
Also Published As
Publication number | Publication date |
---|---|
CN101334843A (en) | 2008-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109544518B (en) | Method and system applied to bone maturity assessment | |
CN109009102B (en) | Electroencephalogram deep learning-based auxiliary diagnosis method and system | |
CN112784856A (en) | Channel attention feature extraction method and identification method of chest X-ray image | |
CN113808738B (en) | Disease identification system based on self-identification image | |
CN113610859B (en) | Automatic thyroid nodule segmentation method based on ultrasonic image | |
Zhou et al. | constitution identification of tongue image based on CNN | |
CN101334843B (en) | Pattern recognition characteristic extraction method and apparatus | |
CN117315379B (en) | Deep learning-oriented medical image classification model fairness evaluation method and device | |
CN116109589A (en) | Image processing method applied to congestive heart failure diagnosis | |
Hernandez-Guedes et al. | Performance evaluation of deep learning models for image classification over small datasets: Diabetic foot case study | |
CN112784924A (en) | Rib fracture CT image classification method based on grouping aggregation deep learning model | |
Oz et al. | Efficacy of biophysiological measurements at FTFPs for facial expression classification: A validation | |
CN116028858A (en) | Pre-labeled self-supervision neural network learning heart beat classification method and system | |
CN110176298B (en) | Kernel principal component spectrum Hash method for diabetic fundus image classification | |
CN111242156B (en) | Hyperplane nearest neighbor classification method for microangioma medical record images | |
Dai et al. | Multimodal Brain Disease Classification with Functional Interaction Learning from Single fMRI Volume | |
Mahmud et al. | An Interpretable Deep Learning Approach for Skin Cancer Categorization | |
CN113080929A (en) | anti-NMDAR encephalitis image feature classification method based on machine learning | |
CN113033330A (en) | Tongue posture abnormality distinguishing method based on light convolutional neural network | |
Hamaamin et al. | Classification of COVID-19 on Chest X-Ray Images Through the Fusion of HOG and LPQ Feature Sets | |
CN117893528B (en) | Method and device for constructing cardiovascular and cerebrovascular disease classification model | |
Thepade et al. | Haar wavelet pyramid-based melanoma skin cancer identification with ensemble of machine learning algorithms | |
Mukahar | Performance comparison of pcanet-based deep learning techniques for palmprint recognition | |
CN108961249A (en) | One cervical cancer cells identifying and diagnosing method again | |
Kumar et al. | Heart Plaque Detection with Improved Accuracy using K-Nearest Neighbors classifier Algorithm in comparison with Least Squares Support Vector Machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20100825 Termination date: 20170629 |