CN109977227B

CN109977227B - Text feature extraction method, system and device based on feature coding

Info

Publication number: CN109977227B
Application number: CN201910205999.6A
Authority: CN
Inventors: 张旭; 熊彦钧; 何赛克; 刘春阳; 郑晓龙; 陈志鹏; 曾大军; 彭鑫
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2021-06-22
Anticipated expiration: 2039-03-19
Also published as: CN109977227A

Abstract

The invention belongs to the field of information classification, and particularly relates to a text feature extraction method, system and device based on feature coding, aiming at solving the problems of high operation complexity, low classification efficiency and low precision in text feature extraction. The method comprises the following steps: preprocessing the acquired text to acquire a word candidate characteristic sequence; generating a plurality of binary codes based on the word candidate feature sequence; screening binary codes by adopting a genetic algorithm to obtain optimal binary codes; and decoding the optimal binary code to obtain and output an optimal word characteristic sequence. The invention converts a series of candidate characteristics into an easily-processed coding sequence, and uses the automatic screening function of the genetic algorithm to carry out the maximized global optimal selection on the characteristics, thereby effectively screening the minimum effective characteristic set.

Description

Text feature extraction method, system and device based on feature coding

Technical Field

The invention belongs to the field of information classification, and particularly relates to a text feature extraction method, system and device based on feature coding.

Background

With the rapid development and popularization of internet technology, it is urgent to make full use of mass data that is growing day by day for each large internet company and related scientific research institutions. Among these data, the text-type data is the most voluminous. In the use of text data, classification occupies the half-wall Jiangshan, which refers to a process of automatically determining text categories according to text contents under a given classification system. The text classification at present has a very wide application scene, for example, a large number of report articles contained in a news website are automatically classified according to the subject based on the article content; classifying the evaluation made on the commodities after the user transacts in the e-commerce website; identifying junk mails from a plurality of mails through a text classification technology and filtering junk advertisement information frequently received by an electronic mailbox; a large number of postings received by the media every day are automatically checked by means of a text classification technology, so that illegal contents such as junk advertisements, yellow-related contents, violence and the like in the postings are marked.

Until the 90 s of the 20 th century, the dominant text classification methods have been heuristic: with the help of professionals, a large number of inference rules are defined for each category, and if a document can satisfy these inference rules, it can be judged to belong to the category. However, this method has significant disadvantages: the quality of classification depends largely on the quality of the rules; a large number of professionals are required to make rules; the method has no popularization, and different fields need to construct completely different classification systems, thereby causing huge waste of development resources and capital resources.

The current machine learning techniques are well suited to solve the above problems. Machine learning is based on a statistical theory, an algorithm is utilized to enable a machine to have human-like automatic learning capacity, namely, known training data are subjected to statistical analysis to obtain a rule, and then the rule is utilized to perform predictive analysis on unknown data. The basic process of applying the machine learning method to text classification is as follows: labeling, namely accurately classifying a batch of documents by manual work to serve as a training set (material for machine learning); training, wherein the computer digs some rules capable of being effectively classified from the documents to generate a classifier; and classifying, namely applying the generated classifier to the document set to be classified to obtain the classification result of the document.

Feature extraction is an important ring when using machine learning for text classification. Most of the current Chinese text classification systems use words as feature items, called feature words. The characteristic words are used as an intermediate representation form of the document and are used for realizing similarity calculation between the document and between the document and a user target. If all words are used as feature items, the dimension of the feature vector is too high, which causes great stress on the operation performance of the classification system, and results in reduced timeliness of text classification. Therefore, an effective feature dimension reduction method is sought to reduce the operation complexity and improve the classification efficiency and precision, which is urgently needed in the field at present.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, the problems of high operation complexity, low classification efficiency and low precision in text feature extraction, the present invention provides a text feature extraction method based on feature coding, which includes:

step S10, acquiring a word candidate characteristic sequence of an input text;

step S20, generating M binary codes based on the word candidate characteristic sequence, wherein M is a positive integer;

s30, screening the M binary codes by adopting a genetic algorithm to obtain optimal binary codes;

and step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence as the extracted text feature and outputting the extracted text feature.

In some preferred embodiments, the step of "obtaining a word candidate feature sequence of the input text" in step S10 includes the steps of:

step S11, dividing the input text into words by text word segmentation algorithm to form a text word set;

step S12, carrying out weight calculation on each word in the text word set to obtain the weight corresponding to the text word set;

and step S13, selecting words with preset number as word candidate characteristic sequences according to the sequence of the weights from big to small.

In some preferred embodiments, step S20, "generate M binary codes based on the word candidate feature sequence", includes the steps of:

step S21, randomly arranging words in the word candidate characteristic sequence to obtain M random characteristic sequences;

and step S22, generating M binary codes with the length same as that of the word candidate characteristic sequence from the M random characteristic sequences.

In some preferred embodiments, the step S30 "screening the M binary codes by using genetic algorithm to obtain the optimal binary code" includes the steps of:

step S31, taking the M binary codes as an M group gene population, and calculating the fitness of each individual in the M group gene population;

and step S32, obtaining the optimal binary code by adopting a genetic algorithm method based on the fitness of each individual in the M groups of gene populations.

In some preferred embodiments, the step S32 "obtaining the optimal binary code by using genetic algorithm based on the fitness of each individual in the M-group gene population" comprises the following steps:

step S321, calculating the probability that each individual in the M groups of gene populations is inherited into the next generation population:

wherein, f (x)_i) Is as followsFitness function of i individuals of the gene population, f (x)_j) Is the fitness function of the jth gene population;

step S322, calculating the cumulative probability of each individual according to the probability of each individual being inherited to the next generation group:

step S323, at [0, 1 ]]Generating a uniformly distributed pseudo-random number r in the interval if r < q_iThen individual 1 is selected, otherwise individual k is selected such that: q. q.s_k-1＜r≤q_kIf true;

step S324, repeatedly executing step S333 for 2M times, selecting M groups of individuals, and triggering single-point cross exchange on each group of two individuals in the M groups at a cross rate alpha to obtain a filial binary code;

step S325, calculating the variance ratio beta_mTriggering a certain bit in the child binary code, and performing binary 0-1 replacement to obtain the optimal binary code.

In some preferred embodiments, after "calculating the fitness of each individual in the M groups of gene populations" in step S31, the gene variation rate can be calculated to improve the efficiency of genetic algorithm:

wherein, beta_mProviding dynamically changed gene variation rate for different distribution of fitness in population, wherein beta is individual fitness and beta_maxIs the maximum fitness in the population, beta_avgIs the mean fitness of the population, k₁、k₂Is a constant.

On the other hand, the invention provides a text feature extraction system based on feature coding, which comprises an acquisition module, a preprocessing module, a feature coding module, a feature screening module, a decoding module and an output module;

the acquisition module is configured to acquire and input a text;

the preprocessing module is configured to preprocess the acquired text to obtain a word candidate characteristic sequence;

the feature coding module is configured to generate M binary codes based on the word candidate feature sequence, wherein M is a positive integer;

the characteristic screening module is configured to screen the M binary codes by adopting a genetic algorithm to obtain optimal binary codes;

the decoding module is configured to decode the optimal binary code to obtain a corresponding optimal word feature sequence;

and the output module is configured to take the optimal word feature sequence as the extracted text feature and output the extracted text feature.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being suitable for being loaded and executed by a processor to implement the above-mentioned feature-coding-based text feature extraction method.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is suitable to be loaded and executed by a processor to implement the above-mentioned feature encoding-based text feature extraction method.

The invention has the beneficial effects that:

(1) the text feature extraction method based on feature coding combines the genetic algorithm to realize the selection of the text features, can effectively overcome the limitations in the traditional text feature selection, improves the accuracy of the text features as much as possible in a controllable range, simultaneously realizes feature dimension reduction to the maximum extent and effectively improves the feature use efficiency.

(2) The invention provides a feature screening method based on feature coding and genetic algorithm aiming at the defects of high redundancy and low precision of the features obtained by the existing text feature extraction method.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a feature-coding-based text feature extraction method according to the present invention;

FIG. 2 is a schematic flow chart of a candidate sequence obtained by text preprocessing of the feature-coding-based text feature extraction method according to the present invention;

FIG. 3 is a schematic diagram of a feature encoding flow of the text feature extraction method based on feature encoding according to the present invention;

FIG. 4 is a schematic diagram of a genetic algorithm of the feature-coding-based text feature extraction method according to the present invention;

FIG. 5 is a diagram illustrating an exemplary cross-exchange process of binary encoding according to an embodiment of the feature-coding-based text feature extraction method of the present invention;

fig. 6 is a diagram illustrating a binary coding mutation according to an embodiment of the method for extracting text features based on feature coding.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention provides a text feature extraction method based on feature coding, which is a text feature coding method based on binary system and combines a genetic algorithm to realize the selection of text features, can effectively overcome the limitations in the traditional text feature selection, improves the accuracy of the text features as much as possible in a controllable range, simultaneously realizes feature dimension reduction to the maximum extent and effectively improves the feature use efficiency.

The invention relates to a text feature extraction method based on feature coding, which comprises the following steps:

step S10, acquiring a word candidate characteristic sequence of an input text;

In order to more clearly describe the feature-coding-based text feature extraction method of the present invention, each step in the embodiment of the method of the present invention is described in detail below with reference to fig. 1.

The text feature extraction method based on feature coding of an embodiment of the invention comprises steps S10-S40, and the steps are described in detail as follows:

step S10, a word candidate feature sequence of the input text is obtained. As shown in fig. 2, which is a schematic flow diagram of obtaining a candidate sequence by text preprocessing of the feature coding-based text feature extraction method of the present invention, the text is first segmented into words, then word weight calculation is performed, and finally a candidate feature sequence is generated, specifically as follows:

and step S11, dividing the input text into words by adopting a text word segmentation algorithm to form a text word set.

Text word segmentation is a basic step of text processing and is also a basic module of man-machine natural language interaction. The difference between the chinese text and the english text is that there is no word boundary in the chinese sentence, so when the chinese natural language processing is performed, word segmentation is usually performed first, and the word segmentation effect directly affects the effect of the parts of speech, the syntax tree and other modules. The word segmentation is only a tool, and the scene is different, and the requirements are different. In man-machine natural language interaction, a mature Chinese word segmentation algorithm can achieve a better natural language processing effect and help a computer to understand complex Chinese languages.

The text word segmentation algorithm comprises the following steps: dictionary-based word segmentation algorithms such as a forward maximum matching method, a reverse maximum matching method, a bidirectional matching word segmentation method and the like; statistical-based machine learning algorithms such as Hidden Markov Model (HMM), Conditional Random Field (CRF), deep learning algorithms, and the like; there are also word segmentation methods based on neural networks, which are not described one by one here.

And step S12, performing weight calculation on each word in the text word set to obtain the weight corresponding to the text word set.

The word weight calculation has a mature method, and the embodiment of the invention adopts a commonly used TF-IDF (Term Frequency-Inverse Document Frequency) method for weight calculation. TF-IDF is a statistical method to assess how important a word is for one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

And step S20, generating M binary codes based on the word candidate characteristic sequence, wherein M is a positive integer. As shown in fig. 3, which is a schematic diagram of a feature encoding flow of the feature encoding-based text feature extraction method of the present invention, a random feature sequence is first generated, and then a plurality of groups of random binary codes are generated according to the sequence, specifically as follows:

and step S21, randomly arranging the words in the word candidate characteristic sequence to obtain M random characteristic sequences.

Step S30, screening the M binary codes by using a genetic algorithm to obtain an optimal binary code, as shown in fig. 4, which is a schematic flow chart of the genetic algorithm of the feature-coding-based text feature extraction method of the present invention, and specifically includes the following steps:

In the preferred embodiment of the invention, the optimal binary code is selected using roulette selection.

Step S321, calculating the probability that each individual in the M groups of gene populations is inherited into the next generation population, as shown in formula (1):

wherein, f (x)_i) As a fitness function of the individual of the ith gene population, f (x)_j) Is the fitness function of the jth gene population;

step S322, calculating the cumulative probability of each individual according to the probability that each individual is inherited to the next generation group, as shown in formula (2):

step S324, repeatedly executing step S333 for 2M times, selecting M groups of individuals, and triggering single-point cross exchange on two individuals in each group in the M groups at a cross rate α to obtain a progeny binary code. As shown in fig. 5, which is an exemplary diagram of a binary code cross-exchange process according to an embodiment of the feature-code-based text feature extraction method of the present invention, a set of binary codes is first copied, then the obtained binary codes are cross-exchanged, and one exchanged binary code is randomly stored.

Step S325, calculating the variance ratio beta_mTriggering a certain bit in the child binary code, and performing binary 0-1 replacement to obtain the optimal binary code. As shown in fig. 6, which is an exemplary diagram of binary coding mutation according to an embodiment of the feature-coding-based text feature extraction method of the present invention, the binary codes before and after mutation are only represented in opposite directions at the compiling point, and the rest are the same.

After "calculating the fitness of each individual in the group M of gene populations" in step S31, the gene variation rate may also be calculated, so as to improve the efficiency of the genetic algorithm, as shown in formula (3):

And step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence, and outputting the optimal word feature sequence as the extracted text feature.

The text feature extraction system based on feature coding of the second embodiment of the invention comprises an acquisition module, a preprocessing module, a feature coding module, a feature screening module, a decoding module and an output module;

the acquisition module is configured to acquire and input a text;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the text feature extraction system based on feature coding provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to realize the above-mentioned feature-coding-based text feature extraction method.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable to be loaded and executed by a processor to implement the above-mentioned feature encoding-based text feature extraction method.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A text feature extraction method based on feature coding is characterized by comprising the following steps:

step S10, acquiring a word candidate characteristic sequence of an input text;

step S30, taking the M binary codes as an M group gene population, calculating the fitness of each individual in the M group gene population, and obtaining the optimal binary codes by adopting a genetic algorithm based on the fitness of each individual in the M group gene population;

step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence as an extracted text feature and outputting the extracted text feature;

after the fitness of each individual in the M groups of gene populations is calculated, the gene variation rate can be calculated:

2. The method for extracting text features based on feature coding according to claim 1, wherein the step of "obtaining word candidate feature sequences of the input text" in step S10 comprises:

3. The method for extracting text features based on feature coding according to claim 1, wherein "generating M binary codes based on the word candidate feature sequence" in step S20 includes the steps of:

4. The method for extracting text features based on feature coding according to claim 1, wherein in step S30, "based on the fitness of each individual in the M groups of gene populations, an optimal binary code is obtained by using a genetic algorithm", and the steps are as follows:

step S31, calculating the probability that each individual in the M group gene population is inherited into the next generation population:

step S32, calculating the cumulative probability of each individual according to the probability of each individual being inherited to the next generation group:

step S33, at [0, 1 ]]Generating a uniformly distributed pseudo-random number r in the interval if r < q_iThen individual 1 is selected, otherwise individual k is selected such that: q. q.s_k-1＜r≤q_kIf true;

step S34, repeating step S33 for 2M times, selecting M groups of individuals, and triggering single-point cross exchange on each group of two individuals in the M groups at a cross rate alpha to obtain a child binary code;

step S35, calculating the variation rate beta_mTriggering a certain bit in the child binary code, and performing binary 0-1 replacement to obtain the optimal binary code.

5. A text feature extraction system based on feature coding is characterized by comprising an acquisition module, a preprocessing module, a feature coding module, a feature screening module, a decoding module and an output module;

the acquisition module is configured to acquire and input a text;

the characteristic screening module is configured to take the M binary codes as an M group gene population, calculate the fitness of each individual in the M group gene population, and obtain the optimal binary codes by adopting a genetic algorithm based on the fitness of each individual in the M group gene population;

the output module is configured to take the optimal word feature sequence as the extracted text feature and output the extracted text feature;

6. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the feature encoding-based text feature extraction method of any one of claims 1-4.

7. A treatment apparatus comprises

A processor adapted to execute various programs; and

a storage device adapted to store a plurality of programs;

wherein the program is adapted to be loaded and executed by a processor to perform:

the feature coding-based text feature extraction method of any one of claims 1 to 4.