CN111553127B

CN111553127B - Multi-label text data feature selection method and device

Info

Publication number: CN111553127B
Application number: CN202010261235.1A
Authority: CN
Inventors: 孙林; 王天翔; 李文凤; 李梦梦
Original assignee: Henan Normal University
Current assignee: Henan Normal University
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2023-11-24
Anticipated expiration: 2040-04-03
Also published as: CN111553127A

Abstract

The invention relates to a multi-label text data feature selection method and device, and belongs to the technical field of text data processing. Firstly, considering the second-order correlation between marks in a text data set, grouping the marks so that the marks can be better suitable for a multi-mark data set, determining the final score of each feature according to the score calculated by the feature in each mark group, and selecting the features with higher scores and the set number from the final score to form a feature set; and then determining the neighborhood granularity of each sample according to the classification interval of each sample in the text data set for the mark based on the obtained feature set to obtain a multi-mark neighborhood decision system, calculating the importance degree by utilizing the improved dependence of the neighborhood rough set, and screening the obtained feature set so as to realize the feature selection of the multi-mark text data. Compared with the original neighborhood rough set feature selection method aiming at all attributes, the method has lower time complexity and more accurate optimal feature subset.

Description

Multi-label text data feature selection method and device

Technical Field

The invention relates to a multi-label text data feature selection method and device, and belongs to the technical field of text data processing.

Background

Multi-label learning is a research hotspot in the fields of pattern recognition, machine learning, data mining, data analysis, and the like. In multi-label learning, each instance is described not only by a set of feature vectors, but also corresponds to a plurality of decision attributes. There are also many problems in real life that fall into the category of multi-marker learning, such as: a movie may belong to multiple categories at the same time, such as "action", "science fiction" and "war"; a document may have multiple topics at the same time, such as "medicine", "science" and "artificial intelligence"; an image may be annotated with multiple semantics, such as "street", "car" and "pedestrian" at the same time. It is difficult to accurately classify such problems using the single tag classification method, and therefore, in recent years, scholars have paid more attention to multi-tag learning.

Many challenges are faced in studying multi-label classification: on the one hand, each instance may have a plurality of category labels at the same time, and there is also a certain association between the labels; on the other hand, in multi-label data, the dimension of the data is usually high, and this may cause dimension disasters, which seriously affect the classification performance of the classifier. Therefore, in data preprocessing, the dimension reduction technology is important. Feature extraction and feature selection are main means of feature dimension reduction, wherein the former converts the original high-dimensional features into a new low-dimensional space through a conversion or mapping method; the latter selects an optimal set of feature subsets from the original feature space according to certain evaluation criteria. There are three main ways of feature selection to process multi-labeled data: filtration, encapsulation, and embedding. The filtering method depends on the general characteristics of training data, and uses the characteristic selection process as a preprocessing step, so that the method has lower calculation cost and better generalization capability; the packing method uses a basic model to carry out multiple rounds of training, after each round of training, the characteristics of a plurality of weight coefficients are removed, and then the next round of training is carried out based on a new characteristic set, so that the method is relatively expensive in calculation; the embedded approach integrates the feature selection process into the training process to reduce the overall time for reclassifying different subsets.

The Fisher Score method evolved by Fisher discriminant analysis (Fisher Discriminate Analysis, FDA) is a relatively common feature selection method under supervised learning. In 2002, guyou et al propose a F-score feature selection formula very similar to Fisher discriminant analysis; subsequently, chen et al proposed an F-score derived expression based on a classification problem; in 2010 Salih et al improved the F-Score for the first time, so that the improved F-Score could be applied to the multi-classification problem; in 2011, gu and the like consider the correlation and redundancy among features, further perfect the F-Score, and put forward a generalized Fisher Score; in 2012, xie Juanying et al consider the dimensional problem between features, an improvement on the multi-class Fisher Score; in 2013, tao et al added a weight coefficient to a conventional formula in consideration of overlapping between categories and consistency of features. However, conventional Fisher Score can typically only be calculated for a single labeled dataset.

Feature selection is an indispensable preprocessing link in multi-tag learning. Multi-tag learning is commonly used to handle many complex tasks. Among various feature selection methods, a rough set is attracting much attention as a specific granularity calculation model, and is due to the following advantages: only the attributes contained in the dataset are used, without any other information and the ability to discover data dependencies and reduce the number of features under the constraint of a limited set of information. The Zhang and Li propose a fractal endpoint detection multi-marking algorithm based on a rough set, so as to keep better performance and process noise with higher irregularity than voice; creep et al propose a main strategy for converting a multi-tag feature selection task into a plurality of binary single-tag feature selection tasks in a feature selection task, called problem conversion, but it cuts off the relationship between tags, and easily creates unbalanced data. Conventional asperity models can only process discrete data, and for data containing real values or noisy data, discrete preprocessing is typically employed, which can result in poor classification accuracy. To overcome this drawback, many researchers have supplemented and improved on the traditional rough set theory, such as Li et al have studied a feature reduction method based on a neighborhood rough set and a discrimination matrix; zhang et al put forward different fuzzy relations based on different types of attributes to measure similarity between samples, put forward some robust Fuzzy Rough Set (FRS) models to enhance the robustness of classical FRS; wang et al construct a local neighborhood rough set to process the marker data. In the improvement mode, in the importance degree calculation, the judgment conditions are too strict, and the characteristics with similar importance degrees cannot be further judged, so that the final selected characteristics are not accurate enough.

Disclosure of Invention

The invention aims to provide a multi-mark text data feature selection method and device, which are used for solving the problems of low accuracy and complex algorithm of the existing multi-mark text data feature selection method.

The invention provides a multi-mark text data feature selection method for solving the technical problems, which comprises the following steps:

1) Acquiring a text dataset containing a plurality of markers;

2) Dividing the marks into three mark groups of positive correlation, negative correlation and uncorrelation according to the second-order correlation between the marks in the text data set;

3) Calculating the score of the feature in each marking group according to the category of the marking group, determining the final score of each feature according to the score calculated by the feature in each marking group, and selecting a set number of features with higher scores from the final score to form a feature set;

4) Determining the neighborhood granularity of each sample according to the classification interval of each sample in the text data set for the mark to obtain a multi-mark neighborhood rough set;

5) Constructing a multi-label neighborhood decision system according to the neighborhood granularity and the feature set, and determining the attribute belonging to the set X under the multi-label neighborhood decision system _j The sum of the aggregate numbers of j=1, 2, … M belongs to the aggregateAnd determining the dependence of the multi-label neighborhood rough set based thereon, wherein M is the number of decision attributes in the decision set, and X _j And->Dividing the sample set under the jth mark, wherein the sample set hit the jth mark and the sample set miss the jth mark are respectively represented;

6) And calculating the importance degree of the conditional attribute relative to the decision attribute in the multi-label neighborhood decision system according to the dependency degree of the multi-label neighborhood rough set, and screening the conditional attribute according to the importance degree to realize the feature selection of the text data.

The invention also provides a multi-marked text data feature selection device, which comprises a processor and a memory, wherein the processor executes a computer program stored by the memory to realize the multi-marked text data feature selection method.

Firstly, considering the second-order correlation between marks in a text data set, grouping the marks, calculating the scores of the features in each mark group according to the category of the mark group, improving the Fisher-Score method so that the Fisher-Score method can be better suitable for a multi-mark data set, determining the final Score of each feature according to the Score calculated by each mark group of the features, and selecting the features with higher scores and the set number from the scores to form a feature set; and then determining the neighborhood granularity of each sample according to the classification interval of each sample in the text data set for the mark based on the obtained feature set to obtain a multi-mark neighborhood rough set, calculating importance by utilizing the dependence of the neighborhood rough set, and screening the obtained feature set again to realize feature selection of the multi-mark text data. The time complexity of the neighborhood rough set feature selection algorithm for the population of attributes is lower and the optimal feature subset is searched more accurately than the original.

Further, to be better suited for use with multi-labeled text datasets, the calculation formula for each feature score is:

C＝{f ₁ ,f ₂ ,…,f _m the feature corpus, l= { L }, is represented by ₁ ,l ₂ ,…,l _t The sign corpus, n _k Represents the number of the k-th sample, f _j,i Represents the value of the ith feature in the jth sample, μ _k Representing the ith feature f in the sample _i Is used for the average value of (a),representing the ith feature f in the sample _i Average value in kth class, c represents total number of classes, R _g (l _a ,l _b ) Indicating mark l _a And mark l _b Is used for the correlation weight of (1).

Further, in order to avoid noise data interference, the classification interval of the sample to the mark is:

wherein, margin ^l (x) For sample x for label l _i Classification interval of NM ^l (x) For each heterogeneous sample distance arranged in ascending order, NH ^l (x) For each similar sample distance arranged in ascending order, |NH ^l (x) I is the number of samples of the same class, |NM ^l (x) I is the number of heterogeneous samples, NM ^l (x _i ) And NH ^l (x _i ) Respectively, a heterogeneous sample near the sample i and a homogeneous sample near the sample i under the category label l, delta (x, NM ^l (x _i ) And delta (x, NH) ^l (x _i ) Respectively representing sample points x to NM ^l (x _i ) And NH ^l (x _i ) Is a distance of (3).

Further, in order to divide the neighborhood rough set more accurately, the calculation formula of the neighborhood granularity is as follows:

wherein the method comprises the steps of For sample x for label l _i Is the number of marks, M ^l (x) Is the neighborhood granularity of the sample.

Further, the multi-label neighborhood decision system is MDNS= < U, C U D, delta >, U= { x ₁ ,x ₂ ,…,x _n A set of text data samples,B＝{f ₁ ,f ₂ ,…,f _N the letter "is a feature subset describing the text data, the letter" C "is a feature set describing the text data, the letter N is not more than |C|, the letter L= { L ₁ ,l ₂ ,…,l _M The corresponding label set, d= { l }, is ₁ ,l ₂ ,…,l _m 'is a set of classification decision attributes,'>

Further, in order to effectively reduce the risk that important attributes are ignored, the calculation formula of the multi-label neighborhood rough set dependency is as follows:

ρ _B (D) As the weight coefficient, |h (δ _B (x _i ) The () | indicates that under feature set B, it belongs to set X _j J=1, 2, … M number of sets, |m (δ _B (x _i ) The () | indicates that under feature set B, it belongs to the collectionThe number of sets, |U| is the number of samples owned by the training set, |L| is the number of marks owned by the mark set, _B NDfor the lower approximation of the multi-labeled neighborhood rough set, δ _B (x _i ) For the sample set in the neighborhood radius of the ith sample under feature subset B, D ^j The representation has a category label l _j Sample set D of (2) _i Representing sample x _i The set of labels that it has, u= { x ₁ ,x ₂ ,…,x _n The sample set, b= { f }, is represented ₁ ,f ₂ ,…,f _N -describing a feature subset.

Further, the calculation formula of the importance degree is:

Wherein sig (a, B, D) is a conditional attribute aεC-B relative to blockImportance of the policy attributes D,for the dependence of the decision attribute D on the conditional attribute B U.a->The dependency of the decision attribute D on the condition attribute B is represented.

Drawings

FIG. 1 is a schematic diagram of classification intervals for a sample according to an embodiment of the invention;

FIG. 2 is a flow chart of a multi-labeled text class data feature selection method of the present invention;

FIG. 3-a is a diagram showing the comparison of the index AP of the present invention with the index AP of the prior art method in the Business data set in the experimental example;

FIG. 3-b is a graph showing the comparison of the index CV of the present invention with the index CV of the prior art method in the Business data set of the experimental example;

FIG. 3-c is a diagram showing the comparison of the index HL of the present invention with the index HL of the prior art method in the Business data set in the experimental example;

FIG. 3-d is a graphical representation of the comparison of the present invention with the RL index of the prior art method in the Business data set of the experimental example;

FIG. 3-e is a diagram showing the comparison of the index MicF1 of the present invention with the index MicF1 of the prior art method in the Business data set of the experimental example;

FIG. 4-a is a graph showing the comparison of the index AP of the present invention with the index AP of the prior art method in the Computer data set of the experimental example;

FIG. 4-b is a graph showing the comparison of the index CV of the present invention with that of the prior art method in the Computer data set of experimental examples;

FIG. 4-c is a graph showing the comparison of the index HL of the present invention with the index HL of the conventional method in the Computer dataset of the experimental example;

FIG. 4-d is a graph showing the comparison of the index RL of the present invention with the index RL of the prior art method in the Computer data set in the experimental example;

FIG. 4-e is a graph showing the comparison of the present invention with the index MicF1 of the prior art method in the Computer dataset of the experimental example;

fig. 5 is a block diagram showing the structure of a multi-label text-based data feature selection apparatus according to the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings.

Method embodiment

Before describing the specific means of the present invention, the present invention will be described with reference to some knowledge, fisher-Score algorithm and neighborhood coarse set algorithm.

1) Related concepts of mutual information

Let A, B be two events, and P (A) >0, the conditional probability of event B occurring under the condition that event A occurs is:

for a discrete random variable x= { X ₁ ,x ₂ ,…,x _n The entropy of the random variable X can be expressed as:

wherein P (x) _i ) To generate event x _i Probability of (2); n is the total number of events (states) that may occur. Obviously, for a fully determined variable X, H (X) =0; for the random variable X, there is H (X)>0 (non-negative), and the value of H (X) increases (incremental) with the increase of the state number n, namely, the more the value number of the random variable is, the more the state number is, the greater the information entropy is, the greater the disorder degree is, and the entropy is the largest when the random distribution is uniform distribution.

For two different discrete random variables x= { X ₁ ,x ₂ ,…,x _n Sum y= { Y ₁ ,y ₂ ,…,y _m Then the joint entropy of the random variable X and the random variable Y can be defined as:

wherein P (x) _i ,y _j ) Is x _i And y _j Joint probability of (2), event x _i And y _j Probability of simultaneous occurrence.

For two different discrete random variables x= { X ₁ ,x ₂ ,…,x _n Sum y= { Y ₁ ,y ₂ ,…,y _m Then the conditional entropy of the random variable X for the random variable Y can be defined as:

wherein P (y) _j ) For single occurrence of event y _j Probability of p (x) _i |y _j ) To be in occurrence of event y _j Event x under the condition of (2) _i Conditional probability of occurrence. Obviously, when X and Y are completely independent, there is H (x|y) =h (X); and when X and Y are fully correlated, there is H (x|y) =0; for the related variables in general, there is H (X|Y)>0. Similarly we can define the conditional entropy of the random variable Y for the random variable X as:

for the whole variable X, due to occurrence of the variable Y and correlation between the variable Y and the variable Y, entropy values with reduced uncertainty are called mutual information, and are defined as follows:

I(X,Y)＝H(X)-H(X|Y) (6)

wherein H (X) is the information entropy of the random variable X, and H (X|Y) is the conditional entropy of the random variable X to the random variable Y. It can be demonstrated that the mutual information is non-negative, i.e. I (X, Y) gtoreq 0, while also being reciprocal, i.e.:

I(X,Y)＝H(X)-H(X|Y)＝H(Y)-H(Y|X)＝I(Y,X) (7)

the joint entropy of the random variable X and the random variable Y is as follows:

Wherein H (X) and H (Y) represent information entropy of the random variable X and the random variable Y, respectively, and H (X, Y) represents joint entropy of the random variable X and the random variable Y.

The mutual information has the defect of no normalization, and can be normalized by using generalized correlation function in order to compare the mutual dependence degree between different variables

Wherein R is 0.ltoreq.R _g (X, Y). Ltoreq.1, it can be seen that when the random variable X and the random variable Y are completely correlated, there is I (X, Y) =H (X) =H (Y), R _g (X, Y) =1; when X and Y are completely independent, I (X, Y) =0, r _g ＝0，R _g The larger the value of (c) is, the stronger the correlation between the random variable X and the random variable Y is.

The mutual information measures the statistical independent relation between a certain feature and a class, and the feature f and the class l are aimed at _i The mutual information formula of (2) can be defined as

Wherein P (f, l) _i ) Representing that the training set contains both feature f and category l _i P (f) represents the probability that the training set contains the feature f, P (l) _i ) Indicating that the class in the training set belongs to l _i Probability of P (f|l) _i ) Expressed in category l _i Including the probability of feature f. As can be seen from equation (10), when P (f|l _i )>P(f)，MI(f,l _i )>0, the characteristic f and the class l are explained _i Is positively correlated, while MI (f, l _i ) The larger the value of the description feature f and class l _i The stronger the positive correlation of (2); in contrast, when P (f|l _i )<P(f)，MI(f,l _i )<0, the characteristic f and the class l are explained _i Is inversely related, while MI (f, l _i ) Smaller values illustrate the features f and classesPin l _i The stronger the negative correlation of (c).

2) Fisher-Score algorithm

The Fisher-Score is an effective criterion for evaluating sample characteristics, and the traditional Fisher-Score is derived from a Fisher linear discrimination method, and is essentially to select characteristics with small intra-class differences and large inter-class differences.

Given feature set { f ₁ ,f ₂ ,…,f _m Training samples x taken from c (c.gtoreq.2) classes } _j ∈R ^m J=1, 2, …, N, defining the ith feature f of the training sample _i Inter-class divergence S of (1) _b (f _i ) And the kth sample at the ith feature f _i Lower intra-class divergenceThe method comprises the following steps:

wherein n is _k For the number of samples of the k-th class,mu, which is the mean value of the kth sample under the ith characteristic _i For the mean value of the whole sample under the ith feature, +.>For the ith sample in the kth class, the jth sample is found in the ith feature f _i The following values. Thus, the ith feature f of the training sample can be obtained _i Is:

it can be seen that the first in formula (13)i features f _i Degree of interspecific divergence S _b (f _i ) The larger the sum of intra-class divergences of the c classesThe smaller the FS (f _i ) The larger the value of (a) is, the description of the characteristic f _i The stronger the discrimination of (c), the greater the feature importance.

3) Neighborhood rough set algorithm

Let U denote the sample space and x be a given sample, the classification interval for sample x be:

margin(x)＝Δ(x-NM(x))-Δ(x-NH(x)) (14)

where NH (x) represents the same type of sample closest to sample x in sample space U, called the Nearest Hit (NH) for x. And NM (x) represents a heterogeneous sample Nearest to sample x in sample space U, referred to as a Nearest Mix (NM) of x. Delta (x-NM (x)) and delta (x-NH (x)) then represent the distances of the sample points x to NM (x) and NH (x), respectively (see FIG. 1).

Assuming U is the sample space, forx may be affiliated to the tag set l= { L ₁ ,l ₂ ,…,l _t Among the }, givenThe classification interval of sample x under the label l is defined as:

m ^l (x)＝Δ(x,NM ^l (x))-Δ(x,NH ^l (x)) (15)

wherein NH is ^l (x) Representing the closest congener root to x in sample space U under category label l; NM (NM) ^l (x) The heterogeneous sample closest to the sample is represented under category label l. Delta (x-NH) ^l (x) Sum delta (x-NM) ^l (x) Respectively representing sample points x to NH ^l (x) And NM (NM) ^l (x) Is a distance of (3).

Assuming that the sample space is U, the set of labels l= { L ₁ ,l ₂ ,…,l _t For (E) }Given->Classification interval m when sample x is under label l ^l (x) 0, the neighborhood of x is:

δ(x)＝{y|Δ(x,y)≤m ^l (x),y∈U} (16)

when m is ^l (x) When the temperature is less than or equal to 0, let m ^l (x)＝0。

Assuming ds= < U, Δ > is a non-empty metric space, x e U, δ+.0 is the δ neighborhood of the point set x, expressed as:

δ(x)＝{y|Δ(x,y)≤δ,y∈U}. (17)

consider the set u= { x of all samples ₁ ,x ₂ ,…,x _n },A＝{a ₁ ,a ₂ ,…,a _N The set of conditional properties describing the sample, d= { l ₁ ,l ₂ ,…,l _m ' is a set of classification decision attributes, given<U,A,D>If A generates a set of neighborhood relationships, then we call<U,A,D>Is a neighborhood decision making system.

Given a non-empty finite set Ω on real space and a neighborhood relation N thereon, i.e., a binary set ns=<U,N>，{X ₁ ,X ₂ ,…,X _n If the number of the adjacent neighbor approximation spaces is equal to a plurality of equivalence classes, then X is equal to the number of the adjacent neighbor approximation spaces NS =<U,N>The upper approximation and the lower approximation of (a) are respectively:

the approximate boundaries of X are:

for single-label learning, the lower approximation of the neighborhood rough set embodies the ability of the attribute set to classify the sample by borrowing the concept of a neighborhood. In multi-label learning, the definition of the following approximation is also similar. The relevant concepts and properties of the multi-label neighborhood rough set model are given below.

In the multi-label neighborhood decision system mnds=<U,C,D,f,Δ,δ>In the label set l= { L ₁ ,l ₂ ,…,l _m }，D _j The representation has a category label l _j Set of (D) ⁱ Representing sample x _i Set of markers provided, givenThe approximate space of the multi-labeled neighborhood rough set is defined as:

the approximate boundary of X is

In the multi-label neighborhood decision system mnds=<U,C∪D>In which, in the above-mentioned steps, _B Nd is called the positive domain of the multi-label classification at the knowledge level given by attribute B, denoted POS _B (D) A. The invention relates to a method for producing a fibre-reinforced plastic composite Thus, the dependence of the multi-label classification can be expressed as:

mnds=in a multi-label neighborhood decision system <U,C∪D>When 0 is less than or equal to r _B (D) And less than or equal to 1. Then:

1) When r is _B (D) When=1, D is strongly dependent on B.

2) When 0 is<r _B (D)<At 1, D is weakly dependent on B.

3) When r is _B (D) When=0, D is completely independent of B.

The definition of the dependency degree reflects the importance of the decision attribute to the condition attribute, so that not only can the dependency degree of the result classification attribute to the condition attribute be inspected, but also the key attribute which plays a decisive role in classification can be effectively discovered. Thus, the importance of the condition attribute a ε C-B on the condition attribute B relative to the decision attribute set D can be expressed as:

sig(a,B,D)＝γ _B∪{a} (D)-γ _B (D). (25)

from the definition of attribute importance, attribute a is redundant when sig (a, B, D) =0. And there are two cases: one is that the attribute a is irrelevant to the current classification task, and the other is that the classification task contained in the attribute a is already contained in other attributes, and the attribute is also called redundancy.

On the basis of the technology, the Fisher-Score method is improved by combining the mutual information theory basis and the second-order mark correlation, so that the Fisher-Score method can be better suitable for a multi-mark data set; then, score calculation is carried out on each feature according to an MLFisher-Score (modified Fisher-Score) method, and calculation results are arranged in a descending order to obtain a feature sequence; selecting some attributes with higher scores in the feature sequences obtained by calculation through an MLFisher-Score method; finally, under the attributes, according to the neighborhood rough set with improved classification intervals of all samples in the text data set for the marks, the improved neighborhood rough set attribute dependence and importance calculation formula are used for carrying out feature selection, so that the feature selection of the multi-mark text is realized, and the implementation flow of the method is shown in figure 2, and the specific process is as follows.

1. Multi-label text data is acquired.

2. The marks are divided into three mark groups of positive correlation, negative correlation and uncorrelated according to the second-order correlation between marks in the acquired text data set.

The marked sets of the multi-marked data are binary distributions, i.e. present or absent. In order to better recognize whether the relationship between the two markers is positive, negative or uncorrelated, the invention calculates the correlation between the two markers according to equation (26) on the basis of equation (10),

for a given one of the multi-labeled datasets mnds= < X, C u L >, x= { X ₁ ,x ₂ ,…,x _n The whole set of samples is denoted by c= { f ₁ ,f ₂ ,…,f _m The property corpus is represented by l= { L } ₁ ,l ₂ ,…l _t Sign j represents the corpus of signs, sign l _i And mark l _j The correlation of (2) is:

wherein P (l) _j ) For marking at l _j Under, probability of tag hit, P (l) _j |l _i ) Indicated at the label l _i On the premise of hit, tag l _j Probability of hit. Similarly, a mark can be obtainedAnd mark->Is a correlation of (a):

the label l is calculated by the formula (26) and the formula (27), respectively _i And mark l _j Correlation and labeling of (3)And mark->I define a new way of calculating the correlation between the markers by these two equations, i.e

Wherein MI (l) _i |l _j ) For marking l _i And/l _j Is a correlation of (2);for marking->And mark->Is (are) related to->Andindicating miss flag l _i And l _j 。

By analyzing the whole tag set, the tag set in most multi-tag data sets is found to be a sparse matrix, i.e. the number of missed tags is much larger than the number of hit tags, and obviously, for this reason, the knowledge importance when two tags hit at the same time is much larger than that when two tags miss at the same time. To address this, a corresponding improvement is made for equation (28), as follows:

wherein θ is an importance parameter, and the calculation method is as follows

Wherein,representing the number of hits for all tags in the tag set, +.>The j-th mark of the i-th sample is represented, nt represents the total number of samples multiplied by the total number of marks, and obviously 0 is less than or equal to theta is less than or equal to 1, and the thinner the matrix is, the smaller the value of theta is, and the larger the correlation weight of the marks hit simultaneously is. The positive and negative correlation between marks is calculated by formula (29), if ρ _ij If the result of (2) is greater than 0, then the instruction flag l _i And mark l _j Is positively correlated; if ρ _ij If the result of (2) is less than 0, then the instruction mark l _i And mark l _j Is inversely related; if ρ _ij If the result of (2) is equal to 0, then the instruction flag l _i And mark l _j Is uncorrelated. However, it is apparent that when ρ _ij When the value of (2) is very close to 0, it is clear that the independence between the two markers is far beyond that point, so we first apply to ρ _ij Normalizing the values of (c) so that ρ _ij All values of (2) are mapped to [ -1,1]Within this interval, then, it is specified that when |ρ _ij When the level is less than or equal to 0.2, the mark l is described _i And mark l _j Is irrelevant; when-1 is less than or equal to ρ _ij With < -0.2, description label l _i And mark l _j Is inversely related; when 0.2 < ρ _ij When less than or equal to 1, the instruction mark l _i And mark l _j Is positively correlated.

If two markers are considered as a collective, then all marker sets are four of the following: (1) { tag 1 hit, tag 2 hit }, (2) { tag 1 hit, tag 2 miss }, (3) { tag 1 miss, tag 2 hit }, (4) { tag 1 miss, tag 2 miss }, these four cases are respectively denoted as {1,1}, {1,0}, {0,1} and {0,0}, two cases of {1,0} and {0,1} being the same category, and three cases of which features can better distinguish this category from {1,1}, {0,0} should be considered at this time; secondly, regarding the mark group of the negative correlation, regarding the two cases of {1,1} and {0,0} as the same category, considering which features can better distinguish the category from the three cases of {1,0}, {0,1 }; finally, for the irrelevant sets of labels, any set of cases is not ignored, considering which features can better distinguish {1,1}, {1,0}, {0,1} from {0,0}.

If there is a clear correlation between two labels, for example, there are two categories "finance" and "economy" in the text classification, it is obvious that there is a strong positive correlation between the two categories, i.e., the two labels tend to appear simultaneously or do not appear simultaneously, at this time, the two cases are divided mainly by the assistance of the two cases, that is, the two categories appear simultaneously and do not appear simultaneously and can be respectively regarded as two opposite topics, and the other cases cannot be completely ignored, there may be some other labels or some key features that are decided, so at this time, three cases of { {1,0}, {0,1}, {1,1} and {0,0} are considered, and the negative correlation is the same as no correlation.

3. And calculating the scores of the features in each marker group according to the categories of the marker groups, determining the final score of each feature according to the scores calculated by the features in each marker group, and selecting a set number of features with higher scores from the final scores to form a feature set.

If a feature is discriminative, the variance between the feature and the sample of the same class should be as small as possible, and the variance between the feature and the sample of a different class should be as large as possible, so as to facilitate the subsequent operations such as classification, prediction and the like. However, since the correlation between the target sets is different, it is obvious that the Fisher-Score method is used to select the features under the feature set with stronger correlation, and the result is more satisfactory. The original fisher-score calculation formula can only consider the data of a single label, while the text data mostly belongs to the category of multiple labels, and more information is provided for the generation or non-generation of another label according to the second-order correlation between all labels, namely, the correlation between a certain label and the other label, whether the correlation is positive correlation or negative correlation or irrelevant, that is, whether the generation or non-generation of a certain label can provide more information for the generation or non-generation of the other label. However, since the amount of data is too large, the correlation between the markers is difficult to be determined by a certain number of fixed values, so that according to the calculation result of the formula (29), the correlation between the markers is analyzed, that is, according to the difference in the value calculated by the formula (29), the correlation between the markers is also different, and then, according to the difference in the correlation between the markers, the weights of knowledge information when analyzing the set of markers are also different, that is, the greater the weights of knowledge provided by the set of markers having stronger correlation, the greater the score contribution to the feature is.

For a multi-tag dataset MNDS= < X, C U L >, X= { X ₁ ,x ₂ ,…,x _n The whole set of samples is denoted by c= { f ₁ ,f ₂ ,…,f _m The feature corpus, l= { L }, is represented by ₁ ,l ₂ ,…,l _t And represents the corpus of tokens. For the followingThe score for each feature in the set of markers is as follows:

wherein n is _k Represents the number of the k-th sample, f _j,i Represents the value of the ith feature in the jth sample, μ _k Representing the ith feature f in the sample _i Is used for the average value of (a),representing the ith feature f in the sample _i Average value in k-th class, c represents total number of classes, and c is different according to correlation degree, when l _a And/l _b In the positive or negative correlation, c has a value of 3, when l _i And/l _j When uncorrelated, c has a value of 4, R _g (l _a ,l _b ) (equation (9)) represents the label l _a And mark l _b Is used for the correlation weight of (1). As can be seen from the formula, when the label l _a And mark l _b The stronger the correlation of (a)The higher the feature score of this calculation, the greater the importance of the score calculated for the more relevant set of markers.

And (3) carrying out weighted average on the feature scores calculated by each marker group, and finally arranging the feature scores in a descending order to obtain a feature sequence after preprocessing the multi-marker data set, wherein the feature sequence is also called a feature set.

4. And determining the neighborhood granularity of each sample according to the classification interval of each sample in the text data set for the mark, so as to obtain a multi-mark neighborhood rough set.

In the original boundary field calculation mode, since only the distances between the target sample and the most recent similar samples and the most recent heterogeneous samples are considered, the calculation is very sensitive to noise. When the text data set is analyzed, the original granularity calculation formula is easy to be interfered by noise data, part of samples, but not all samples, are considered through Euclidean distance, so that the interference of the noise data can be effectively avoided, meanwhile, when the analyzed samples are noise samples, the noise samples can be more accurately removed through the improved granularity calculation formula, and the problem that the calculation result is greatly deviated when the target samples are noise or the distance between the target samples and the noise samples is relatively close is avoided.

For a given multi-label neighborhood decision system mdns= < U, C U D, δ > u= { x ₁ ,x ₂ ,…,x _n The number of samples represents the set of samples,B＝{f ₁ ,f ₂ ,…,f _N the feature subset is described, N is less than or equal to |C|, and L= { L ₁ ,l ₂ ,…,l _M -set of corresponding tags, +.>Target sample x for marker l _i The classification interval of (2) is:

wherein NM ^l (x) For each heterogeneous sample distance arranged in ascending order, NH ^l (x) For each similar sample distance arranged in ascending order, |NH ^l (x) I is the number of samples of the same class, |NM ^l (x) I is the number of heterogeneous samples, NM ^l (x _i ) And NH ^l (x _i ) The alien sample closest to the sample i and the like sample closest to the sample i are represented under the class mark l, respectively. Delta (x, NM) ^l (x _i ) And delta (x, NH) ^l (x _i ) Respectively representing sample points x to NM ^l (x _i ) And NH ^l (x _i ) Is a distance of (3). If the result margin is calculated ^l (x) If the result of (2) is less than 0, the sample is highly probable to be noise, which makes the margin of the sample ^l (x) =0, then the neighborhood radius for each sample under all labels is defined as follows.

It can be seen that equation (34) is the neighborhood granularity for each sample. At this time, the neighborhood granularity of each sample has been calculated, and according to the new neighborhood granularity, the approximate calculation formula of the multi-label neighborhood rough set up and down is:

wherein delta _B (x _i ) For a set of samples within the neighborhood radius under feature subset B for the ith sample calculated by equation (34).

5. Construction from neighborhood granularity and the feature setA multi-label neighborhood decision system under which a determination is made of set X _j The sum of the aggregate numbers of j=1, 2, … M belongs to the aggregateAnd thereby determine the dependence of the multi-label neighborhood coarse set.

When the original multi-mark neighborhood rough set is used for feature selection, the original multi-mark neighborhood rough set mode is found to be poor in dividing effect when the dependency degree is calculated; in addition, the conventional multi-label neighborhood rough set feature selection algorithm only considers the situation that samples with which characteristics have which labels with high probability, but ignores samples with which characteristics, and the situation that the samples with which characteristics have high probability cannot have the labels. Thus, to address these issues, the present invention improves the dependency function of the original neighborhood rough set.

The samples were partitioned as follows: u= { x ₁ ,x ₂ ,…,x _n The sample set, l= { L }, is represented ₁ ,l ₂ ,…,l _M And the corresponding label set is:

wherein X is _j And (3) withThe division of the sample set under the jth mark represents the sample set hitting the jth mark and the sample set not hitting the jth mark, respectively.

Given a multi-label neighborhood decision system MDNS= < U, C U D, delta >, U= { x ₁ ,x ₂ ,…,x _n The number of samples represents the set of samples,B＝{f ₁ ,f ₂ ,…,f _N the feature subset is described, N is less than or equal to |C|, and L= { L ₁ ,l ₂ ,…,l _M The symbol sets are corresponding symbol sets, and for two sets x= { X divided in the above manner ₁ ,X ₂ ,…,X _M And } and->Then there is the following definition:

two sets X= { X after being divided ₁ ,X ₂ ,…,X _M And (3)Decision attribute D vs condition attribute subset>The degree of dependence of (c) can be expressed as:

/>

wherein ρ is _B (D) As the weight coefficient, |h (δ _B (x _i ) The () | indicates that under feature set B, it belongs to set X _j J=1, 2, … M number of sets, |m (δ _B (x _i ) The () | indicates that under feature set B, it belongs to the collectionThe number of sets, |U| is the number of samples owned by the training set, and |L| is the number of marks owned by the mark set. From equation (41), 0.ltoreq.ρ can be derived _B (D) Is less than or equal to 1 and is equal to->With the same monotonicity, the improved dependence formula gamma _B (D) The following properties are still met:

1) When gamma is _B (D) When=1, D is strongly dependent on B;

2) When 0 < gamma _B (D) When < 1, D is dependent on B;

3) When gamma is _B (D) When=0, D is totally independent of B.

The text data set has the characteristics of large data scale, high latitude and the like, an original dependency calculation formula is used, a newly added dimension or feature in a reduced subset usually shows some cup and firewood for the determination of the whole granularity, namely, the change of samples in a neighborhood is not very large, so that the number of samples in an upper and lower approximate sample set is not very large along with the addition of a feature, thus causing a problem that the relative importance of a certain feature cannot be accurately judged, and a certain feature in the batch of features possibly contains critical information, but the critical feature is ignored due to the reason, so that the quality of the feature subset is fundamentally reduced; the invention adopts an improved dependence calculation mode, and the formula (42) can effectively enlarge the mapping range and effectively reduce the risk that important attributes are ignored.

In the multi-label neighborhood decision system, the definition of the dependency degree reflects the importance degree of the decision attribute to the condition attribute, so that the dependency degree of the result classification attribute to the condition attribute can be examined, key attributes which play a role in determining classification can be found, and the purposes of feature selection and finding of the minimum feature subset are achieved.

6. And calculating the importance degree of the conditional attribute relative to the decision attribute in the multi-label neighborhood decision system according to the dependency degree of the multi-label neighborhood rough set, and screening the conditional attribute according to the importance degree to realize the feature selection of the text data.

In a multi-label neighborhood decision system MNDS= < U, C U D, delta >,if gamma is _B (D)≠γ _B-a (D) It is said that a is necessary in B with respect to the decision attribute D, otherwise not necessary.

In the multi-label neighborhood decision system MNDS= < U, C U D, delta > is provided,if:

(1)γ _B (D)＝γ _C (D)

(2)

then B is an attribute reduction of C, i.e. if the calculated dependence under the current feature subset B is equal to the calculated dependence under the feature corpus C, then it is terminated, the feature subset B at this time being the final selected feature set; wherein, gamma _B (D) Representing the dependence of the decision attribute D on the conditional attribute B, and for any attribute subset based on the neighborhood rough set dependence calculation mode formula (42) The calculation formula of the importance degree of the condition attribute a epsilon C-B relative to the decision attribute D is as follows:

from the viewpoint of attribute dependence, the importance of the attribute may provide an effective feature selection method, and if sig (a, B, D) =0, it is explained that attribute a is a redundant attribute or an irrelevant attribute, i.e., attribute a is irrelevant to the current classification task or classification information contained in attribute a is already contained in other attributes. Therefore, each attribute in the feature set can be filtered according to the importance degree, and redundant attributes or irrelevant attributes can be removed.

Device embodiment

The apparatus proposed in this embodiment, as shown in fig. 5, includes a processor and a memory, where the memory stores a computer program that can be executed on the processor, and the processor implements the method of the foregoing method embodiment when executing the computer program.

That is, the method in the above method embodiments should be understood that the flow of the multi-label text data feature selection method may be implemented by computer program instructions. These computer program instructions may be provided to a processor such that execution of the instructions by the processor results in the implementation of the functions specified in the method flow described above.

The processor in this embodiment refers to a microprocessor MCU or a processing device such as a programmable logic device FPGA;

The memory referred to in this embodiment includes physical means for storing information, typically by digitizing the information and then storing the information in an electrical, magnetic, or optical medium. For example: various memories, RAM, ROM and the like for storing information by utilizing an electric energy mode; various memories for storing information by utilizing a magnetic energy mode, such as a hard disk, a floppy disk, a magnetic tape, a magnetic core memory, a bubble memory and a U disk; various memories, CDs or DVDs, which store information optically. Of course, there are other ways of storing, such as quantum storing, graphene storing, etc.

The device formed by the memory, the processor and the computer program is implemented in the computer by executing corresponding program instructions by the processor, and the processor can be loaded with various operating systems, such as windows operating systems, linux systems, android, iOS systems and the like.

As other embodiments, the device may also include a display for presenting the diagnostic results for reference by the staff.

In order to comprehensively evaluate the invention, an experiment is carried out on the invention through a test data set, the effectiveness of the test data set is judged, and the test data set is compared with other existing algorithms on each index.

In this experiment, two multi-tag text-based datasets were selected, and the specific descriptions of the datasets are shown in table 1. The data set may download http: the// mulan. Sourceforge. Net/data. In order to evaluate the effectiveness of the algorithm provided by the invention, the algorithm is compared with four existing multi-marker feature selection algorithms: MDFS, MDFS-O (manifold regularized discriminativefeature selection), MSSL (multi-label feature selection via feature manifold learning and sparsity regularization), GLOCAL (multi-label learning with global and label correlation).

These experiments were run on a MATLAB2016b platform based on Windows 10 with a 3.00GHz processor and 8.00GB memory space. The experiment evaluates ML-KNN using a multi-label classification model, wherein the smoothing parameter is set to 1, the neighborhood granularity k is set to 10 (parameter setting), and in order to reduce the error, the training set is split into 10 parts and calculated by adopting a mode of ten times of cross-validation and averaging.

TABLE 1

The first part selects five evaluation indexes of feature number (N) and Average Precision (AP), coverage (CV), hamming Loss (HL), ranking Loss (RL), micro-average (MicF 1) to analyze and measure experimental results.

Let the test set beAccording to the prediction function f _l (x) The ranking function may be defined as rank _f (x,l)∈{1,2,…,l}。

N: the number of features selected after the dimension reduction.

Average Precision (AP): in the predictive marker ranking for examining all samples, the average of the probabilities that the markers belonging to the front of the sample marker still belong to the sample marker is defined as:

wherein R is _i ＝{l|Y _il = +1} represents the sum of samples x _i A set of related markers, R _i ＝{l|Y _il = -1} represents the sum of samples x _i The uncorrelated labels constitute a set.

Coverage (CV): for measuring how many steps each sample needs to find to traverse all the markers associated with that sample, the definition is as follows:

hamming Loss (HL): the case used to measure misclassification of a sample on a single class label is defined as:

wherein the method comprises the steps ofRepresenting an exclusive or operation.

Ranking Loss (RL): the average of the probabilities of the ranking of the uncorrelated marks, used to examine all samples, being ranked in front of the correlated marks is defined as:

micro-averaging (MicF 1): the average value of the average of the corresponding elements of each confusion matrix is defined as:

wherein micp _ij And micr _ij Respectively representing the microscopic precision and the microscopic recall.

Among the above 5 evaluation indexes, the larger the values of the indexes AP and MicF1 are, the better the classification performance is; the smaller the values of the indexes CV, HL and RL are, the better the classification performance is, and the optimal value is 0.

3-a, 3-b, 3-c, 3-d, 3-e are graphs comparing the invention with the evaluation indexes of four other multi-label feature selection algorithms under the text data set Business, and the evaluation indexes of 4-a, 4-b, 4-c, 4-d, 4-e are average precision, coverage, hamming loss, ranking loss and micro-F1 comparing the invention with the four other multi-label feature selection algorithms under the text data set Computer. In these pictures, only the indices under the first 100 features are considered, and a line graph is made at intervals of 10.

3-a, 3-b, 3-c, 3-d, 3-e are the results of the comparison of indicators Average Precision (AP), coverage (CV), hamming Loss (HL), ranking Loss (RL) and Micro-F1 (mic), respectively, under the data set Business. As can be seen from the graph, the performance of the invention under five indexes is superior to that of other four algorithms, and when the selected characteristics are more, the performance of the invention under the Hamming Loss (HL) index is poorer. 4-a, 4-b, 4-c, 4-d and 4-e are respectively the comparison results of indexes Average Precision (AP), coverage (CV), hamming Loss (HL), ranking Loss (RL) and Micro-F1 (mic) under the data set Computer, and as can be seen from the graph, the invention has five index performance conditions which are almost the same as MDFS under the condition that the feature number is less than 70 and is superior to other three algorithms; however, in the case of a feature number greater than 70, the five indicators of the present invention all perform better than the other four algorithms.

For further analysis of experimental results, specific values of the present invention and the existing four algorithms under the indexes RL, HL, CV, AP and mic are shown in tables 2 to 6, respectively.

TABLE 2

TABLE 3 Table 3

TABLE 4 Table 4

TABLE 5

TABLE 6

The data bolded in tables 2-6 are the optimal values for the row, and it is apparent from the table that the performance of each index is optimal compared to the other four multi-labeled feature selection algorithms in the present invention.

The experimental result further shows that for the text data set, the feature subset with smaller scale and stronger classifying ability can be selected, and the method has certain advantages compared with the conventional multi-mark feature selection algorithm.

Claims

1. A method for selecting characteristics of multi-labeled text data, the method comprising the steps of:

1) Acquiring a text dataset containing a plurality of markers;

6) And calculating the importance degree of the conditional attribute relative to the decision attribute in the multi-label neighborhood decision system according to the multi-label neighborhood rough set dependency formula, and screening the conditional attribute according to the importance degree to realize the feature selection of the text data.

2. The method of claim 1, wherein the calculation formula for each feature score is:

C＝{f ₁ ,f ₂ ,…,f _m the feature corpus, l= { L }, is represented by ₁ ,l ₂ ,…,l _t The sign corpus, n _k Represents the number of the k-th sample, f _j,i Represent the firstThe value of i features in the j-th sample, μ _k Representing the ith feature f in the sample _i Is used for the average value of (a),representing the ith feature f in the sample _i Average value in kth class, c represents total number of classes, R _g (l _a ,l _b ) Indicating mark l _a And mark l _b Is used for the correlation weight of (1).

3. The method for selecting a feature of multi-labeled text data according to claim 1, wherein the classification interval of the sample to the label is:

4. A multi-labeled text class data feature selection method according to claim 3 wherein the neighborhood granularity is calculated as:

wherein the method comprises the steps ofFor sample x for label l _i Is the number of marks, M ^l (x) Is the neighborhood granularity of sample x.

5. The method for selecting characteristics of multi-labeled text data according to claim 1, wherein the multi-labeled neighborhood decision system is mdns= < U, c= < D, δ >, u= { x ₁ ,x ₂ ,…,x _n A set of text data samples,B＝{f ₁ ,f ₂ ,…,f _N the letter "is a feature subset describing the text data, the letter" C "is a feature set describing the text data, the letter N is not more than |C|, the letter L= { L ₁ ,l ₂ ,…,l _M The corresponding label set, d= { l }, is ₁ ,l ₂ ,…,l _m 'is a set of classification decision attributes,'>

6. The method for selecting a feature of multi-labeled text data according to claim 1, wherein the improved calculation formula of the multi-labeled neighborhood rough set dependency is:

ρ _B (D) As the weight coefficient, |h (δ _B (x _i ) The () | indicates that under feature set B, it belongs to set X _j J=1, 2, … M number of sets, |m (δ _B (x _i ) The () | is represented in the feature setB belongs to the collectionThe number of sets, |U| is the number of samples owned by the training set, |L| is the number of marks owned by the mark set, _B NDfor the lower approximation of the multi-labeled neighborhood rough set, δ _B (x _i ) For the sample set in the neighborhood radius of the ith sample under feature subset B, D ^j The representation has a category label l _j Sample set D of (2) _i Representing sample x _i The set of labels that it has, u= { x ₁ ,x ₂ ,…,x _n The sample set, b= { f }, is represented ₁ ,f ₂ ,…,f _N -describing a feature subset.

7. The method for selecting a feature of multi-labeled text data according to claim 6, wherein the importance is calculated by the formula:

wherein sig (a, B, D) is the importance of the conditional attribute a εC-B relative to the decision attribute D, For the dependence of the decision attribute D on the conditional attribute B U.a->The dependency of the decision attribute D on the condition attribute B is represented.

8. A multi-labeled text class data feature selection device comprising a processor and a memory, said processor executing a computer program stored by said memory to implement a multi-labeled text class data feature selection method as claimed in any one of claims 1-7.