CN109977227A

CN109977227A - Text feature, system, device based on feature coding

Info

Publication number: CN109977227A
Application number: CN201910205999.6A
Authority: CN
Inventors: 张旭; 熊彦钧; 何赛克; 刘春阳; 郑晓龙; 陈志鹏; 曾大军; 彭鑫
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-07-05
Anticipated expiration: 2039-03-19
Also published as: CN109977227B

Abstract

Classify field the invention belongs to information, and in particular to a kind of text feature based on feature coding, system, device, it is intended to solve the problems, such as in Text character extraction that high computational complexity, classification effectiveness and precision are low.The method of the present invention includes: the Text Pretreatment to acquisition, obtains word candidate feature sequence；Word-based candidate feature sequence, generates multiple binary codings；Binary coding is screened using Gene hepatitis B vaccine, obtains optimal binary coding；Optimal binary coding is decoded to obtain optimal word characteristic sequence and export.A series of candidate features are converted tractable coded sequence by the present invention, and using the automatic screening function of Gene hepatitis B vaccine, carries out maximized global optimum to feature and select, can effectively filter out minimum validity feature collection.

Description

Text feature, system, device based on feature coding

Technical field

Classify field the invention belongs to information, and in particular to a kind of text feature based on feature coding, System, device.

Background technique

With the rapid development and universal, facing growing mass data of Internet technology, how sufficiently effectively Utilization have become the task of top priority of major Internet company and related scientific research mechanism.In these data, text class Data are quantity one kind the hugest again.Half of the country is being occupied in use, classifying to text data, what is referred to is Under given classification system, the process of text categories is automatically determined according to content of text.Present text classification has pole To be widely applied scene, for example, article content is based on, by these articles to a large amount of report articles for including in news website Classified automatically by subject matter；To in e-commerce website, classify after customer transaction behavior to the evaluation of commodity made； To the waste advertisements information that E-mail address frequently receives, spam is identified from numerous mails by Text Classification And it filters；To a large amount of submissions that media receive daily, article is audited automatically by Text Classification, thus realization pair The label of waste advertisements in submission, the violations content such as relate to Huang, violence.

Before the 1990s, prevailing file classification method is always heuristic: by profession The help of personnel is a large amount of inference rule of each class declaration, can be with if a document is able to satisfy these inference rules Judgement belongs to the category.But this method have the shortcomings that it is obvious: the quality of classification be largely dependent upon rule it is good It is bad；A large amount of professional is needed to carry out the formulation of rule；Do not have replicability, different fields needs to construct entirely different Categorizing system, cause the huge waste exploited natural resources with fund resources.

The machine learning techniques of popular can be well solved the above problem.Machine learning based on statistical theory, There is automatic " study " ability as the similar mankind by machine using algorithm, i.e., statistical analysis is done to known training data thus It obtains rule, then gives a forecast analysis with regular to unknown data.Machine learning method is used in the basic mistake in text classification Journey are as follows: mark, using manually having carried out Accurate classification to a collection of document, using as training set (material for carrying out machine learning)； Training, computer excavate some rules that can effectively classify from these documents, generate classifier；Classification, by generation Classifier is applied in having collection of document to be sorted, and the classification results of document are obtained.

Feature extraction is important ring when doing text classification using machine learning.Current most of Chinese Text Categorization systems System is all using word as characteristic item, referred to as Feature Words.Intermediate representation of these Feature Words as document, for realizing document With the similarity calculation between document, document and ownership goal.If using all words all as characteristic item, feature vector Dimension will be excessively high, great pressure can be caused to the operational performance of categorizing system, the timeliness of text classification is caused to reduce.Cause And seek a kind of effective feature dimension reduction method, come reduce computational complexity, improve classification efficiency and precision, be it is current this A field there is an urgent need to.

Summary of the invention

In order to solve the above problem in the prior art, i.e., computational complexity height, classification effectiveness in Text character extraction The problem low with precision, the present invention provides a kind of text features based on feature coding, comprising:

Step S10 obtains the word candidate feature sequence of input text；

Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer；

Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding；

Step S40 decodes the optimal binary coding, obtains corresponding optimal word characteristic sequence as the text extracted Eigen simultaneously exports.

In some preferred embodiments, " the word candidate feature sequence for obtaining input text ", step in step S10 Are as follows:

The text of input is divided into word using text segmentation methods, constitutes text word set by step S11；

Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set；

Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.

In some preferred embodiments, in step S20 " it is based on institute's predicate candidate feature sequence, M binary system is generated and compiles Code ", the steps include:

Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence；

The M random character sequence is generated M length identical with institute's predicate candidate feature sequence two by step S22 Scale coding.

In some preferred embodiments, " to the M binary coding, sieved using Gene hepatitis B vaccine in step S30 Choosing, obtains optimal binary coding ", it the steps include:

Step S31 using the M binary coding as M group gene families, and is calculated every in the M group gene families The fitness of individual；

Step S32, the fitness based on each individual in the M group gene families are obtained using Gene hepatitis B vaccine method Obtain optimal binary coding.

In some preferred embodiments, in step S32 " fitness based on each individual in the M group gene families, Using Gene hepatitis B vaccine, optimal binary coding is obtained ", it the steps include:

Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families:

Wherein, f (x_i) be i-th of gene families individual fitness function, f (x_j) it is the suitable of j-th of gene families individual Response function；

Step S322 calculates the tired of each individual according to the probability that each individual is genetic in next-generation group Count probability:

Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] section_i, then individual is selected 1, otherwise, individual k is selected, so that: q_k-1< r≤q_kIt sets up；

Step S324, it is 2M times total to repeat step S333, selects M group individual, to every group of two individuals in the M group with Crossing-over rate α triggering single point crossing exchanges to obtain a filial generation binary coding；

Step S325, with aberration rate β_mThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and sets It changes, obtains optimal binary coding.

In some preferred embodiments, in step S31 " fitness for calculating each individual in the M group gene families " Later, genetic mutation rate can also be calculated, Gene hepatitis B vaccine efficiency is improved:

Wherein, β_mThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree, β_maxIt is maximum fitness, β in group_avgIt is the average fitness of group, k₁、k₂For constant.

Another aspect of the present invention proposes a kind of Text character extraction system based on feature coding, including obtains mould Block, preprocessing module, feature coding module, Feature Selection module, decoder module, output module；

The acquisition module is configured to obtain text and input；

The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence；

The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, and M is positive Integer；

The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains Optimal binary coding；

The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence；

The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.

The third aspect of the present invention proposes a kind of storage device, wherein be stored with a plurality of program, described program be suitable for by Processor is loaded and is executed to realize the above-mentioned text feature based on feature coding.

The fourth aspect of the present invention proposes a kind of processing unit, including processor, storage device；The processor is fitted In each program of execution；The storage device is suitable for storing a plurality of program；Described program be suitable for loaded by processor and executed with Realize the above-mentioned text feature based on feature coding.

Beneficial effects of the present invention:

(1) the present invention is based on the text features of feature coding, in conjunction with Gene hepatitis B vaccine, realize to text spy The selection of sign can effectively overcome the limitation faced in traditional text Feature Selection to mention as far as possible in controllable range The accuracy of high text feature, while Feature Dimension Reduction is realized to greatest extent, effectively improve feature service efficiency.

(2) present invention for feature obtained by existing text feature highly redundant and low precision the shortcomings that, propose Feature Selection method based on feature coding and Gene hepatitis B vaccine, this method convert a series of candidate features to easy to handle Coded sequence, and carry out automatic screening using Gene hepatitis B vaccine, maximized global optimum carried out to feature and is selected, can be with Effectively filter out minimum validity feature collection.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is the flow diagram of the text feature the present invention is based on feature coding；

Fig. 2 is that the Text Pretreatment of the text feature the present invention is based on feature coding obtains the stream of candidate sequence Journey schematic diagram；

Fig. 3 is the feature coding flow diagram of the text feature the present invention is based on feature coding；

Fig. 4 is the Gene hepatitis B vaccine flow diagram of the text feature the present invention is based on feature coding；

Fig. 5 is to intersect to hand over the present invention is based on a kind of binary coding of embodiment of the text feature of feature coding Change process example figure；

Fig. 6 is that the present invention is based on a kind of mutation of the binary coding of embodiment of the text feature of feature coding to show Example diagram.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is only used for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to just Part relevant to related invention is illustrated only in description, attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

The present invention provides a kind of text feature based on feature coding, is encoded based on binary text feature Method, and Gene hepatitis B vaccine is combined, it realizes the selection to text feature, can effectively overcome institute face in traditional text Feature Selection The limitation faced improves the accuracy of text feature in controllable range as far as possible, while realizing feature to greatest extent Dimensionality reduction effectively improves feature service efficiency.

A kind of text feature based on feature coding of the invention, comprising:

Step S10 obtains the word candidate feature sequence of input text；

In order to be more clearly illustrated to the text feature the present invention is based on feature coding, below with reference to figure Each step expansion is described in detail in 1 pair of embodiment of the present invention method.

The text feature based on feature coding of an embodiment of the present invention, including step S10- step S40, Each step is described in detail as follows:

Step S10 obtains the word candidate feature sequence of input text.As shown in Fig. 2, for the present invention is based on feature codings The Text Pretreatment of text feature obtains the flow diagram of candidate sequence, segments to text, then carries out first Word weight calculation ultimately produces candidate feature sequence, specific as follows:

The text of input is divided into word using text segmentation methods, constitutes text word set by step S11.

Text participle is a basic steps of text-processing and the basic module of man-machine natural language interaction.Chinese Text and English text the difference is that, there is no the boundary of word, therefore underway literary natural language processing in Chinese sentence When, it usually needs it is first segmented, participle effect will directly affect the effect of the modules such as part of speech, syntax tree.Certainly participle is One tool, scene are different, it is desirable that also different.In man-machine natural language interaction, mature Chinese Word Automatic Segmentation can reach Better natural language processing effect, helps the Chinese language of computer understanding complexity.

Text segmentation methods have: the segmentation methods based on dictionary, for example, Forward Maximum Method method, reverse maximum matching method and Bi-directional matching participle method etc.；Machine learning algorithm based on statistics, such as Hidden Markov Model algorithm (HMM, Hidden Markov Model), condition random field algorithm (CRF, Conditional Random Field algorithm), deep learning Algorithm etc.；There are also participle methods neural network based, no longer introduce one by one herein.

Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set.

Word weight calculation has had mature method, and used in the embodiment of the present invention is common TF-IDF (Term Frequency-Inverse Document Frequency) method progress weight calculation.TF-IDF is a kind of statistical method, is used To assess a words for the significance level of a copy of it file in a file set or a corpus.Words it is important The directly proportional increase of number that property occurs hereof with it, but the frequency that can occur in corpus with it simultaneously is inversely proportional Decline.

Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer.As shown in figure 3, For it is suitable to firstly generate random character the present invention is based on the feature coding flow diagram of the text feature of feature coding Then sequence arrangement generates multiple groups random binary coding according to sequence, specific as follows:

Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence.

Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding, As shown in figure 4, for the present invention is based on the Gene hepatitis B vaccine flow diagrams of the text feature of feature coding, specifically It is as follows:

In the preferred embodiment of the invention, optimal binary coding is screened using roulette wheel selection.

Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families, such as formula (1) shown in:

Step S322 calculates the tired of each individual according to the probability that each individual is genetic in next-generation group Probability is counted, as shown in formula (2):

Step S324, it is 2M times total to repeat step S333, selects M group individual, to every group of two individuals in the M group with Crossing-over rate α triggering single point crossing exchanges to obtain a filial generation binary coding.As shown in figure 5, for the present invention is based on feature codings A kind of embodiment of text feature binary coding cross exchange process example figure, one group of binary system is compiled first Code is replicated, and then carries out cross exchange, the random binary coding saved after an exchange to obtained binary coding.

Step S325, with aberration rate β_mThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and sets It changes, obtains optimal binary coding.As shown in fig. 6, for the present invention is based on a kind of realities of the text feature of feature coding The binary coding mutation exemplary diagram for applying example, the binary coding before variation and after variation is only phase antirepresentation in compiling point, It is remaining identical.

In step S31 after " fitness for calculating each individual in the M group gene families ", gene change can also be calculated Different rate improves Gene hepatitis B vaccine efficiency, as shown in formula (3):

The optimal binary coding is decoded, corresponding optimal word characteristic sequence is obtained, as extraction by step S40 Text feature simultaneously exports.

The Text character extraction system based on feature coding of second embodiment of the invention, including obtain module, pretreatment Module, feature coding module, Feature Selection module, decoder module, output module；

The acquisition module is configured to obtain text and input；

Person of ordinary skill in the field can be understood that, for convenience and simplicity of description, foregoing description The specific work process of system and related explanation, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

It should be noted that the Text character extraction system provided by the above embodiment based on feature coding, only with above-mentioned The division of each functional module carries out for example, in practical applications, can according to need and by above-mentioned function distribution by difference Functional module complete, i.e., by the embodiment of the present invention module or step decompose or combine again, for example, above-mentioned implementation The module of example can be merged into a module, multiple submodule can also be further split into, to complete whole described above Or partial function.For module involved in the embodiment of the present invention, the title of step, it is only for distinguish modules or Person's step, is not intended as inappropriate limitation of the present invention.

A kind of storage device of third embodiment of the invention, wherein being stored with a plurality of program, described program is suitable for by handling Device is loaded and is executed to realize the above-mentioned text feature based on feature coding.

A kind of processing unit of fourth embodiment of the invention, including processor, storage device；Processor is adapted for carrying out each Program；Storage device is suitable for storing a plurality of program；Described program is suitable for being loaded by processor and being executed to realize above-mentioned base In the text feature of feature coding.

Person of ordinary skill in the field can be understood that, for convenience and simplicity of description, foregoing description The specific work process and related explanation of storage device, processing unit, can refer to corresponding processes in the foregoing method embodiment, Details are not described herein.

Those skilled in the art should be able to recognize that, mould described in conjunction with the examples disclosed in the embodiments of the present disclosure Block, method and step, can be realized with electronic hardware, computer software, or a combination of the two, software module, method and step pair The program answered can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electric erasable and can compile Any other form of storage well known in journey ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field is situated between In matter.In order to clearly demonstrate the interchangeability of electronic hardware and software, in the above description according to function generally Describe each exemplary composition and step.These functions are executed actually with electronic hardware or software mode, depend on technology The specific application and design constraint of scheme.Those skilled in the art can carry out using distinct methods each specific application Realize described function, but such implementation should not be considered as beyond the scope of the present invention.

Term " includes " or any other like term are intended to cover non-exclusive inclusion, so that including a system Process, method, article or equipment/device of column element not only includes those elements, but also including being not explicitly listed Other elements, or further include the intrinsic element of these process, method, article or equipment/devices.

So far, it has been combined preferred embodiment shown in the drawings and describes technical solution of the present invention, still, this field Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this Under the premise of the principle of invention, those skilled in the art can make equivalent change or replacement to the relevant technologies feature, these Technical solution after change or replacement will fall within the scope of protection of the present invention.

Claims

1. a kind of text feature based on feature coding characterized by comprising

Step S10 obtains the word candidate feature sequence of input text；

Step S40 decodes the optimal binary coding, and it is special as the text extracted to obtain corresponding optimal word characteristic sequence It levies and exports.

2. the text feature according to claim 1 based on feature coding, which is characterized in that in step S10 " the word candidate feature sequence for obtaining input text ", the steps include:

3. the text feature according to claim 1 based on feature coding, which is characterized in that in step S20 " being based on institute's predicate candidate feature sequence, generate M binary coding ", the steps include:

The M random character sequence is generated M length binary system identical with institute's predicate candidate feature sequence by step S22 Coding.

4. the text feature according to claim 1 based on feature coding, which is characterized in that in step S30 " to the M binary coding, being screened using Gene hepatitis B vaccine, obtain optimal binary coding ", the steps include:

Step S31 using the M binary coding as M group gene families, and is calculated in the M group gene families per each and every one The fitness of body；

Step S32 obtains optimal two using Gene hepatitis B vaccine based on the fitness of each individual in the M group gene families Scale coding.

5. the text feature according to claim 4 based on feature coding, which is characterized in that in step S32 " fitness based on each individual in the M group gene families is obtained optimal binary system and is compiled using roulette selection method Code ", the steps include:

Wherein, f (x_i) be i-th of gene families individual fitness function, f (x_j) be j-th of gene families individual fitness Function；；

Step S322 calculates the accumulative general of each individual according to the probability that each individual is genetic in next-generation group Rate:

Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] section_i, then individual 1 is selected, it is no Then, individual k is selected, so that: q_k-1< r≤q_kIt sets up；

Step S324, it is 2M times total to repeat step S333, M group individual is selected, to every group of two individuals in the M group to intersect Rate α triggering single point crossing exchanges to obtain a filial generation binary coding；

Step S325, with aberration rate β_mThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and replaces, obtains Optimal binary coding.

6. the text feature according to claim 4 based on feature coding, which is characterized in that in step S31 After " fitness for calculating each individual in the M group gene families ", genetic mutation rate can also be calculated, improves gene genetic Efficiency of algorithm:

7. a kind of Text character extraction system based on feature coding, which is characterized in that including obtain module, preprocessing module, Feature coding module, Feature Selection module, decoder module, output module；

The acquisition module is configured to obtain text and input；

The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, M is positive whole Number；

8. a kind of storage device, wherein being stored with a plurality of program, which is characterized in that described program is suitable for being loaded and being held by processor Row is to realize the text feature described in any one of claims 1-6 based on feature coding.

9. a kind of processing unit, including

Processor is adapted for carrying out each program；And

Storage device is suitable for storing a plurality of program；

It is characterized in that, described program is suitable for being loaded by processor and being executed to realize:

Text feature described in any one of claims 1-6 based on feature coding.