CN109977227A - Text feature, system, device based on feature coding - Google Patents

Text feature, system, device based on feature coding Download PDF

Info

Publication number
CN109977227A
CN109977227A CN201910205999.6A CN201910205999A CN109977227A CN 109977227 A CN109977227 A CN 109977227A CN 201910205999 A CN201910205999 A CN 201910205999A CN 109977227 A CN109977227 A CN 109977227A
Authority
CN
China
Prior art keywords
feature
text
coding
sequence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910205999.6A
Other languages
Chinese (zh)
Other versions
CN109977227B (en
Inventor
张旭
熊彦钧
何赛克
刘春阳
郑晓龙
陈志鹏
曾大军
彭鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Original Assignee
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, National Computer Network and Information Security Management Center filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201910205999.6A priority Critical patent/CN109977227B/en
Publication of CN109977227A publication Critical patent/CN109977227A/en
Application granted granted Critical
Publication of CN109977227B publication Critical patent/CN109977227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)

Abstract

Classify field the invention belongs to information, and in particular to a kind of text feature based on feature coding, system, device, it is intended to solve the problems, such as in Text character extraction that high computational complexity, classification effectiveness and precision are low.The method of the present invention includes: the Text Pretreatment to acquisition, obtains word candidate feature sequence;Word-based candidate feature sequence, generates multiple binary codings;Binary coding is screened using Gene hepatitis B vaccine, obtains optimal binary coding;Optimal binary coding is decoded to obtain optimal word characteristic sequence and export.A series of candidate features are converted tractable coded sequence by the present invention, and using the automatic screening function of Gene hepatitis B vaccine, carries out maximized global optimum to feature and select, can effectively filter out minimum validity feature collection.

Description

Text feature, system, device based on feature coding
Technical field
Classify field the invention belongs to information, and in particular to a kind of text feature based on feature coding, System, device.
Background technique
With the rapid development and universal, facing growing mass data of Internet technology, how sufficiently effectively Utilization have become the task of top priority of major Internet company and related scientific research mechanism.In these data, text class Data are quantity one kind the hugest again.Half of the country is being occupied in use, classifying to text data, what is referred to is Under given classification system, the process of text categories is automatically determined according to content of text.Present text classification has pole To be widely applied scene, for example, article content is based on, by these articles to a large amount of report articles for including in news website Classified automatically by subject matter;To in e-commerce website, classify after customer transaction behavior to the evaluation of commodity made; To the waste advertisements information that E-mail address frequently receives, spam is identified from numerous mails by Text Classification And it filters;To a large amount of submissions that media receive daily, article is audited automatically by Text Classification, thus realization pair The label of waste advertisements in submission, the violations content such as relate to Huang, violence.
Before the 1990s, prevailing file classification method is always heuristic: by profession The help of personnel is a large amount of inference rule of each class declaration, can be with if a document is able to satisfy these inference rules Judgement belongs to the category.But this method have the shortcomings that it is obvious: the quality of classification be largely dependent upon rule it is good It is bad;A large amount of professional is needed to carry out the formulation of rule;Do not have replicability, different fields needs to construct entirely different Categorizing system, cause the huge waste exploited natural resources with fund resources.
The machine learning techniques of popular can be well solved the above problem.Machine learning based on statistical theory, There is automatic " study " ability as the similar mankind by machine using algorithm, i.e., statistical analysis is done to known training data thus It obtains rule, then gives a forecast analysis with regular to unknown data.Machine learning method is used in the basic mistake in text classification Journey are as follows: mark, using manually having carried out Accurate classification to a collection of document, using as training set (material for carrying out machine learning); Training, computer excavate some rules that can effectively classify from these documents, generate classifier;Classification, by generation Classifier is applied in having collection of document to be sorted, and the classification results of document are obtained.
Feature extraction is important ring when doing text classification using machine learning.Current most of Chinese Text Categorization systems System is all using word as characteristic item, referred to as Feature Words.Intermediate representation of these Feature Words as document, for realizing document With the similarity calculation between document, document and ownership goal.If using all words all as characteristic item, feature vector Dimension will be excessively high, great pressure can be caused to the operational performance of categorizing system, the timeliness of text classification is caused to reduce.Cause And seek a kind of effective feature dimension reduction method, come reduce computational complexity, improve classification efficiency and precision, be it is current this A field there is an urgent need to.
Summary of the invention
In order to solve the above problem in the prior art, i.e., computational complexity height, classification effectiveness in Text character extraction The problem low with precision, the present invention provides a kind of text features based on feature coding, comprising:
Step S10 obtains the word candidate feature sequence of input text;
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer;
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding;
Step S40 decodes the optimal binary coding, obtains corresponding optimal word characteristic sequence as the text extracted Eigen simultaneously exports.
In some preferred embodiments, " the word candidate feature sequence for obtaining input text ", step in step S10 Are as follows:
The text of input is divided into word using text segmentation methods, constitutes text word set by step S11;
Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set;
Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.
In some preferred embodiments, in step S20 " it is based on institute's predicate candidate feature sequence, M binary system is generated and compiles Code ", the steps include:
Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence;
The M random character sequence is generated M length identical with institute's predicate candidate feature sequence two by step S22 Scale coding.
In some preferred embodiments, " to the M binary coding, sieved using Gene hepatitis B vaccine in step S30 Choosing, obtains optimal binary coding ", it the steps include:
Step S31 using the M binary coding as M group gene families, and is calculated every in the M group gene families The fitness of individual;
Step S32, the fitness based on each individual in the M group gene families are obtained using Gene hepatitis B vaccine method Obtain optimal binary coding.
In some preferred embodiments, in step S32 " fitness based on each individual in the M group gene families, Using Gene hepatitis B vaccine, optimal binary coding is obtained ", it the steps include:
Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families:
Wherein, f (xi) be i-th of gene families individual fitness function, f (xj) it is the suitable of j-th of gene families individual Response function;
Step S322 calculates the tired of each individual according to the probability that each individual is genetic in next-generation group Count probability:
Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] sectioni, then individual is selected 1, otherwise, individual k is selected, so that: qk-1< r≤qkIt sets up;
Step S324, it is 2M times total to repeat step S333, selects M group individual, to every group of two individuals in the M group with Crossing-over rate α triggering single point crossing exchanges to obtain a filial generation binary coding;
Step S325, with aberration rate βmThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and sets It changes, obtains optimal binary coding.
In some preferred embodiments, in step S31 " fitness for calculating each individual in the M group gene families " Later, genetic mutation rate can also be calculated, Gene hepatitis B vaccine efficiency is improved:
Wherein, βmThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree, βmaxIt is maximum fitness, β in groupavgIt is the average fitness of group, k1、k2For constant.
Another aspect of the present invention proposes a kind of Text character extraction system based on feature coding, including obtains mould Block, preprocessing module, feature coding module, Feature Selection module, decoder module, output module;
The acquisition module is configured to obtain text and input;
The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence;
The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, and M is positive Integer;
The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains Optimal binary coding;
The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence;
The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.
The third aspect of the present invention proposes a kind of storage device, wherein be stored with a plurality of program, described program be suitable for by Processor is loaded and is executed to realize the above-mentioned text feature based on feature coding.
The fourth aspect of the present invention proposes a kind of processing unit, including processor, storage device;The processor is fitted In each program of execution;The storage device is suitable for storing a plurality of program;Described program be suitable for loaded by processor and executed with Realize the above-mentioned text feature based on feature coding.
Beneficial effects of the present invention:
(1) the present invention is based on the text features of feature coding, in conjunction with Gene hepatitis B vaccine, realize to text spy The selection of sign can effectively overcome the limitation faced in traditional text Feature Selection to mention as far as possible in controllable range The accuracy of high text feature, while Feature Dimension Reduction is realized to greatest extent, effectively improve feature service efficiency.
(2) present invention for feature obtained by existing text feature highly redundant and low precision the shortcomings that, propose Feature Selection method based on feature coding and Gene hepatitis B vaccine, this method convert a series of candidate features to easy to handle Coded sequence, and carry out automatic screening using Gene hepatitis B vaccine, maximized global optimum carried out to feature and is selected, can be with Effectively filter out minimum validity feature collection.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is the flow diagram of the text feature the present invention is based on feature coding;
Fig. 2 is that the Text Pretreatment of the text feature the present invention is based on feature coding obtains the stream of candidate sequence Journey schematic diagram;
Fig. 3 is the feature coding flow diagram of the text feature the present invention is based on feature coding;
Fig. 4 is the Gene hepatitis B vaccine flow diagram of the text feature the present invention is based on feature coding;
Fig. 5 is to intersect to hand over the present invention is based on a kind of binary coding of embodiment of the text feature of feature coding Change process example figure;
Fig. 6 is that the present invention is based on a kind of mutation of the binary coding of embodiment of the text feature of feature coding to show Example diagram.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is only used for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to just Part relevant to related invention is illustrated only in description, attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
The present invention provides a kind of text feature based on feature coding, is encoded based on binary text feature Method, and Gene hepatitis B vaccine is combined, it realizes the selection to text feature, can effectively overcome institute face in traditional text Feature Selection The limitation faced improves the accuracy of text feature in controllable range as far as possible, while realizing feature to greatest extent Dimensionality reduction effectively improves feature service efficiency.
A kind of text feature based on feature coding of the invention, comprising:
Step S10 obtains the word candidate feature sequence of input text;
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer;
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding;
Step S40 decodes the optimal binary coding, obtains corresponding optimal word characteristic sequence as the text extracted Eigen simultaneously exports.
In order to be more clearly illustrated to the text feature the present invention is based on feature coding, below with reference to figure Each step expansion is described in detail in 1 pair of embodiment of the present invention method.
The text feature based on feature coding of an embodiment of the present invention, including step S10- step S40, Each step is described in detail as follows:
Step S10 obtains the word candidate feature sequence of input text.As shown in Fig. 2, for the present invention is based on feature codings The Text Pretreatment of text feature obtains the flow diagram of candidate sequence, segments to text, then carries out first Word weight calculation ultimately produces candidate feature sequence, specific as follows:
The text of input is divided into word using text segmentation methods, constitutes text word set by step S11.
Text participle is a basic steps of text-processing and the basic module of man-machine natural language interaction.Chinese Text and English text the difference is that, there is no the boundary of word, therefore underway literary natural language processing in Chinese sentence When, it usually needs it is first segmented, participle effect will directly affect the effect of the modules such as part of speech, syntax tree.Certainly participle is One tool, scene are different, it is desirable that also different.In man-machine natural language interaction, mature Chinese Word Automatic Segmentation can reach Better natural language processing effect, helps the Chinese language of computer understanding complexity.
Text segmentation methods have: the segmentation methods based on dictionary, for example, Forward Maximum Method method, reverse maximum matching method and Bi-directional matching participle method etc.;Machine learning algorithm based on statistics, such as Hidden Markov Model algorithm (HMM, Hidden Markov Model), condition random field algorithm (CRF, Conditional Random Field algorithm), deep learning Algorithm etc.;There are also participle methods neural network based, no longer introduce one by one herein.
Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set.
Word weight calculation has had mature method, and used in the embodiment of the present invention is common TF-IDF (Term Frequency-Inverse Document Frequency) method progress weight calculation.TF-IDF is a kind of statistical method, is used To assess a words for the significance level of a copy of it file in a file set or a corpus.Words it is important The directly proportional increase of number that property occurs hereof with it, but the frequency that can occur in corpus with it simultaneously is inversely proportional Decline.
Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer.As shown in figure 3, For it is suitable to firstly generate random character the present invention is based on the feature coding flow diagram of the text feature of feature coding Then sequence arrangement generates multiple groups random binary coding according to sequence, specific as follows:
Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence.
The M random character sequence is generated M length identical with institute's predicate candidate feature sequence two by step S22 Scale coding.
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding, As shown in figure 4, for the present invention is based on the Gene hepatitis B vaccine flow diagrams of the text feature of feature coding, specifically It is as follows:
Step S31 using the M binary coding as M group gene families, and is calculated every in the M group gene families The fitness of individual;
Step S32, the fitness based on each individual in the M group gene families are obtained using Gene hepatitis B vaccine method Obtain optimal binary coding.
In the preferred embodiment of the invention, optimal binary coding is screened using roulette wheel selection.
Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families, such as formula (1) shown in:
Wherein, f (xi) be i-th of gene families individual fitness function, f (xj) it is the suitable of j-th of gene families individual Response function;
Step S322 calculates the tired of each individual according to the probability that each individual is genetic in next-generation group Probability is counted, as shown in formula (2):
Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] sectioni, then individual is selected 1, otherwise, individual k is selected, so that: qk-1< r≤qkIt sets up;
Step S324, it is 2M times total to repeat step S333, selects M group individual, to every group of two individuals in the M group with Crossing-over rate α triggering single point crossing exchanges to obtain a filial generation binary coding.As shown in figure 5, for the present invention is based on feature codings A kind of embodiment of text feature binary coding cross exchange process example figure, one group of binary system is compiled first Code is replicated, and then carries out cross exchange, the random binary coding saved after an exchange to obtained binary coding.
Step S325, with aberration rate βmThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and sets It changes, obtains optimal binary coding.As shown in fig. 6, for the present invention is based on a kind of realities of the text feature of feature coding The binary coding mutation exemplary diagram for applying example, the binary coding before variation and after variation is only phase antirepresentation in compiling point, It is remaining identical.
In step S31 after " fitness for calculating each individual in the M group gene families ", gene change can also be calculated Different rate improves Gene hepatitis B vaccine efficiency, as shown in formula (3):
Wherein, βmThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree, βmaxIt is maximum fitness, β in groupavgIt is the average fitness of group, k1、k2For constant.
The optimal binary coding is decoded, corresponding optimal word characteristic sequence is obtained, as extraction by step S40 Text feature simultaneously exports.
The Text character extraction system based on feature coding of second embodiment of the invention, including obtain module, pretreatment Module, feature coding module, Feature Selection module, decoder module, output module;
The acquisition module is configured to obtain text and input;
The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence;
The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, and M is positive Integer;
The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains Optimal binary coding;
The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence;
The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.
Person of ordinary skill in the field can be understood that, for convenience and simplicity of description, foregoing description The specific work process of system and related explanation, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
It should be noted that the Text character extraction system provided by the above embodiment based on feature coding, only with above-mentioned The division of each functional module carries out for example, in practical applications, can according to need and by above-mentioned function distribution by difference Functional module complete, i.e., by the embodiment of the present invention module or step decompose or combine again, for example, above-mentioned implementation The module of example can be merged into a module, multiple submodule can also be further split into, to complete whole described above Or partial function.For module involved in the embodiment of the present invention, the title of step, it is only for distinguish modules or Person's step, is not intended as inappropriate limitation of the present invention.
A kind of storage device of third embodiment of the invention, wherein being stored with a plurality of program, described program is suitable for by handling Device is loaded and is executed to realize the above-mentioned text feature based on feature coding.
A kind of processing unit of fourth embodiment of the invention, including processor, storage device;Processor is adapted for carrying out each Program;Storage device is suitable for storing a plurality of program;Described program is suitable for being loaded by processor and being executed to realize above-mentioned base In the text feature of feature coding.
Person of ordinary skill in the field can be understood that, for convenience and simplicity of description, foregoing description The specific work process and related explanation of storage device, processing unit, can refer to corresponding processes in the foregoing method embodiment, Details are not described herein.
Those skilled in the art should be able to recognize that, mould described in conjunction with the examples disclosed in the embodiments of the present disclosure Block, method and step, can be realized with electronic hardware, computer software, or a combination of the two, software module, method and step pair The program answered can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electric erasable and can compile Any other form of storage well known in journey ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field is situated between In matter.In order to clearly demonstrate the interchangeability of electronic hardware and software, in the above description according to function generally Describe each exemplary composition and step.These functions are executed actually with electronic hardware or software mode, depend on technology The specific application and design constraint of scheme.Those skilled in the art can carry out using distinct methods each specific application Realize described function, but such implementation should not be considered as beyond the scope of the present invention.
Term " includes " or any other like term are intended to cover non-exclusive inclusion, so that including a system Process, method, article or equipment/device of column element not only includes those elements, but also including being not explicitly listed Other elements, or further include the intrinsic element of these process, method, article or equipment/devices.
So far, it has been combined preferred embodiment shown in the drawings and describes technical solution of the present invention, still, this field Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this Under the premise of the principle of invention, those skilled in the art can make equivalent change or replacement to the relevant technologies feature, these Technical solution after change or replacement will fall within the scope of protection of the present invention.

Claims (9)

1. a kind of text feature based on feature coding characterized by comprising
Step S10 obtains the word candidate feature sequence of input text;
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer;
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding;
Step S40 decodes the optimal binary coding, and it is special as the text extracted to obtain corresponding optimal word characteristic sequence It levies and exports.
2. the text feature according to claim 1 based on feature coding, which is characterized in that in step S10 " the word candidate feature sequence for obtaining input text ", the steps include:
The text of input is divided into word using text segmentation methods, constitutes text word set by step S11;
Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set;
Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.
3. the text feature according to claim 1 based on feature coding, which is characterized in that in step S20 " being based on institute's predicate candidate feature sequence, generate M binary coding ", the steps include:
Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence;
The M random character sequence is generated M length binary system identical with institute's predicate candidate feature sequence by step S22 Coding.
4. the text feature according to claim 1 based on feature coding, which is characterized in that in step S30 " to the M binary coding, being screened using Gene hepatitis B vaccine, obtain optimal binary coding ", the steps include:
Step S31 using the M binary coding as M group gene families, and is calculated in the M group gene families per each and every one The fitness of body;
Step S32 obtains optimal two using Gene hepatitis B vaccine based on the fitness of each individual in the M group gene families Scale coding.
5. the text feature according to claim 4 based on feature coding, which is characterized in that in step S32 " fitness based on each individual in the M group gene families is obtained optimal binary system and is compiled using roulette selection method Code ", the steps include:
Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families:
Wherein, f (xi) be i-th of gene families individual fitness function, f (xj) be j-th of gene families individual fitness Function;;
Step S322 calculates the accumulative general of each individual according to the probability that each individual is genetic in next-generation group Rate:
Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] sectioni, then individual 1 is selected, it is no Then, individual k is selected, so that: qk-1< r≤qkIt sets up;
Step S324, it is 2M times total to repeat step S333, M group individual is selected, to every group of two individuals in the M group to intersect Rate α triggering single point crossing exchanges to obtain a filial generation binary coding;
Step S325, with aberration rate βmThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and replaces, obtains Optimal binary coding.
6. the text feature according to claim 4 based on feature coding, which is characterized in that in step S31 After " fitness for calculating each individual in the M group gene families ", genetic mutation rate can also be calculated, improves gene genetic Efficiency of algorithm:
Wherein, βmThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree, βmaxIt is Maximum fitness, β in groupavgIt is the average fitness of group, k1、k2For constant.
7. a kind of Text character extraction system based on feature coding, which is characterized in that including obtain module, preprocessing module, Feature coding module, Feature Selection module, decoder module, output module;
The acquisition module is configured to obtain text and input;
The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence;
The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, M is positive whole Number;
The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains optimal Binary coding;
The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence;
The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.
8. a kind of storage device, wherein being stored with a plurality of program, which is characterized in that described program is suitable for being loaded and being held by processor Row is to realize the text feature described in any one of claims 1-6 based on feature coding.
9. a kind of processing unit, including
Processor is adapted for carrying out each program;And
Storage device is suitable for storing a plurality of program;
It is characterized in that, described program is suitable for being loaded by processor and being executed to realize:
Text feature described in any one of claims 1-6 based on feature coding.
CN201910205999.6A 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding Active CN109977227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910205999.6A CN109977227B (en) 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910205999.6A CN109977227B (en) 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding

Publications (2)

Publication Number Publication Date
CN109977227A true CN109977227A (en) 2019-07-05
CN109977227B CN109977227B (en) 2021-06-22

Family

ID=67079264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910205999.6A Active CN109977227B (en) 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding

Country Status (1)

Country Link
CN (1) CN109977227B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738354A (en) * 2023-08-15 2023-09-12 国网江西省电力有限公司信息通信分公司 Method and system for detecting abnormal behavior of electric power Internet of things terminal

Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133297A1 (en) * 2001-01-17 2002-09-19 Jinn-Moon Yang Ligand docking method using evolutionary algorithm
WO2004053766A1 (en) * 2002-12-06 2004-06-24 London Health Sciences Centre Research Inc. Reverse translation of protein sequences to nucleotide code
US20070031042A1 (en) * 2005-08-02 2007-02-08 Edmundo Simental Efficient imagery exploitation employing wavelet-based feature indices
US20070061319A1 (en) * 2005-09-09 2007-03-15 Xerox Corporation Method for document clustering based on page layout attributes
CN101068108A (en) * 2007-06-18 2007-11-07 北京中星微电子有限公司 Orthogonal mirror image filter group realizing method and device based on genetic algorithm
CN101246555A (en) * 2008-03-11 2008-08-20 中国科学技术大学 Characteristic optimization method based on coevolution for foot passenger detection
CN101256648A (en) * 2008-04-09 2008-09-03 永凯软件技术(上海)有限公司 Genetic operation operator based on indent structure for producing quening system
CN101271572A (en) * 2008-03-28 2008-09-24 西安电子科技大学 Image segmentation method based on immunity clone selection clustering
CN101315557A (en) * 2008-06-25 2008-12-03 浙江大学 Propylene polymerization production process optimal soft survey instrument and method based on genetic algorithm optimization BP neural network
CN101436345A (en) * 2008-12-19 2009-05-20 天津市市政工程设计研究院 System for forecasting harbor district road traffic requirement based on TransCAD macroscopic artificial platform
CN101533423A (en) * 2009-04-14 2009-09-16 江苏大学 Method for optimizing structure of metallic-plastic composite material
CN101587545A (en) * 2009-06-19 2009-11-25 中国农业大学 Method and system for selecting feature of cotton heterosexual fiber target image
CN101599078A (en) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 A kind of method of text retrieval and device
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN101814086A (en) * 2010-02-05 2010-08-25 山东师范大学 Chinese WEB information filtering method based on fuzzy genetic algorithm
CN101882791A (en) * 2010-07-13 2010-11-10 东北电力大学 Controllable serial capacitor optimal configuration method capable of improving available transmission capacity
CN101968853A (en) * 2010-10-15 2011-02-09 吉林大学 Improved immune algorithm based expression recognition method for optimizing support vector machine parameters
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104657472A (en) * 2015-02-13 2015-05-27 南京邮电大学 EA (Evolutionary Algorithm)-based English text clustering method
CN105005792A (en) * 2015-07-13 2015-10-28 河南科技大学 KNN algorithm based article translation method
CN105740227A (en) * 2016-01-21 2016-07-06 云南大学 Genetic simulated annealing method for solving new words in Chinese segmentation
CN105787088A (en) * 2016-03-14 2016-07-20 南京理工大学 Text information classifying method based on segmented encoding genetic algorithm
CN106971170A (en) * 2017-04-07 2017-07-21 西北工业大学 A kind of method for carrying out target identification using one-dimensional range profile based on genetic algorithm

Patent Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133297A1 (en) * 2001-01-17 2002-09-19 Jinn-Moon Yang Ligand docking method using evolutionary algorithm
WO2004053766A1 (en) * 2002-12-06 2004-06-24 London Health Sciences Centre Research Inc. Reverse translation of protein sequences to nucleotide code
US20070031042A1 (en) * 2005-08-02 2007-02-08 Edmundo Simental Efficient imagery exploitation employing wavelet-based feature indices
US20070061319A1 (en) * 2005-09-09 2007-03-15 Xerox Corporation Method for document clustering based on page layout attributes
CN101068108A (en) * 2007-06-18 2007-11-07 北京中星微电子有限公司 Orthogonal mirror image filter group realizing method and device based on genetic algorithm
CN101246555A (en) * 2008-03-11 2008-08-20 中国科学技术大学 Characteristic optimization method based on coevolution for foot passenger detection
CN101271572A (en) * 2008-03-28 2008-09-24 西安电子科技大学 Image segmentation method based on immunity clone selection clustering
CN101256648A (en) * 2008-04-09 2008-09-03 永凯软件技术(上海)有限公司 Genetic operation operator based on indent structure for producing quening system
CN101315557A (en) * 2008-06-25 2008-12-03 浙江大学 Propylene polymerization production process optimal soft survey instrument and method based on genetic algorithm optimization BP neural network
CN101436345A (en) * 2008-12-19 2009-05-20 天津市市政工程设计研究院 System for forecasting harbor district road traffic requirement based on TransCAD macroscopic artificial platform
CN101533423A (en) * 2009-04-14 2009-09-16 江苏大学 Method for optimizing structure of metallic-plastic composite material
CN101587545A (en) * 2009-06-19 2009-11-25 中国农业大学 Method and system for selecting feature of cotton heterosexual fiber target image
CN101599078A (en) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 A kind of method of text retrieval and device
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN101814086A (en) * 2010-02-05 2010-08-25 山东师范大学 Chinese WEB information filtering method based on fuzzy genetic algorithm
CN101882791A (en) * 2010-07-13 2010-11-10 东北电力大学 Controllable serial capacitor optimal configuration method capable of improving available transmission capacity
CN101968853A (en) * 2010-10-15 2011-02-09 吉林大学 Improved immune algorithm based expression recognition method for optimizing support vector machine parameters
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104657472A (en) * 2015-02-13 2015-05-27 南京邮电大学 EA (Evolutionary Algorithm)-based English text clustering method
CN105005792A (en) * 2015-07-13 2015-10-28 河南科技大学 KNN algorithm based article translation method
CN105740227A (en) * 2016-01-21 2016-07-06 云南大学 Genetic simulated annealing method for solving new words in Chinese segmentation
CN105787088A (en) * 2016-03-14 2016-07-20 南京理工大学 Text information classifying method based on segmented encoding genetic algorithm
CN106971170A (en) * 2017-04-07 2017-07-21 西北工业大学 A kind of method for carrying out target identification using one-dimensional range profile based on genetic algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738354A (en) * 2023-08-15 2023-09-12 国网江西省电力有限公司信息通信分公司 Method and system for detecting abnormal behavior of electric power Internet of things terminal
CN116738354B (en) * 2023-08-15 2023-12-08 国网江西省电力有限公司信息通信分公司 Method and system for detecting abnormal behavior of electric power Internet of things terminal

Also Published As

Publication number Publication date
CN109977227B (en) 2021-06-22

Similar Documents

Publication Publication Date Title
Akhtar et al. Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis
Day et al. Deep learning for financial sentiment analysis on finance news providers
Demir et al. Improving named entity recognition for morphologically rich languages using word embeddings
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN102123172B (en) Implementation method of Web service discovery based on neural network clustering optimization
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
Ghosh et al. Natural language processing fundamentals: build intelligent applications that can interpret the human language to deliver impactful results
Jerzak et al. An improved method of automated nonparametric content analysis for social science
Sharma et al. SentiDraw: Using star ratings of reviews to develop domain specific sentiment lexicon for polarity determination
US20120089620A1 (en) Extracting data
CN114969275A (en) Conversation method and system based on bank knowledge graph
Huang et al. Text classification with document embeddings
Ekbal et al. Simultaneous feature and parameter selection using multiobjective optimization: application to named entity recognition
Abid et al. Semi-automatic classification and duplicate detection from human loss news corpus
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN109977227A (en) Text feature, system, device based on feature coding
Liu Automatic argumentative-zoning using word2vec
Makrehchi et al. Text classification using small number of features
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product
Nevezhin et al. Topic-driven ensemble for online advertising generation
CN114579729A (en) FAQ question-answer matching method and system fusing multi-algorithm model
CN112818215A (en) Product data processing method, device, equipment and storage medium
Alharithi Performance analysis of machine learning approaches in automatic classification of Arabic language
Yuan et al. Big Data Aspect‐Based Opinion Mining Using the SLDA and HME‐LDA Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant