CN109977227A - Text feature, system, device based on feature coding - Google Patents
Text feature, system, device based on feature coding Download PDFInfo
- Publication number
- CN109977227A CN109977227A CN201910205999.6A CN201910205999A CN109977227A CN 109977227 A CN109977227 A CN 109977227A CN 201910205999 A CN201910205999 A CN 201910205999A CN 109977227 A CN109977227 A CN 109977227A
- Authority
- CN
- China
- Prior art keywords
- feature
- text
- coding
- sequence
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 37
- SPSXSWRZQFPVTJ-ZQQKUFEYSA-N hepatitis b vaccine Chemical compound C([C@H](NC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H](CO)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CC=1C2=CC=CC=C2NC=1)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](N)CCSC)C(=O)N[C@@H](CC1N=CN=C1)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CC(O)=O)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](C(C)C)C(=O)OC(=O)CNC(=O)CNC(=O)[C@H](C)NC(=O)[C@H]1N(CCC1)C(=O)[C@H](CC=1C=CC=CC=1)NC(=O)[C@H](CC=1C=CC(O)=CC=1)NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@@H](N)CCCNC(N)=N)C1=CC=CC=C1 SPSXSWRZQFPVTJ-ZQQKUFEYSA-N 0.000 claims abstract description 23
- 229940124736 hepatitis-B vaccine Drugs 0.000 claims abstract description 23
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000008859 change Effects 0.000 claims description 7
- 230000002068 genetic effect Effects 0.000 claims description 7
- 230000035772 mutation Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 229910002056 binary alloy Inorganic materials 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 4
- 230000006978 adaptation Effects 0.000 claims description 3
- 230000004075 alteration Effects 0.000 claims description 3
- 230000001960 triggered effect Effects 0.000 claims description 3
- 238000010187 selection method Methods 0.000 claims description 2
- 238000012216 screening Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 239000002699 waste material Substances 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000005316 response function Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000011430 maximum method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Genetics & Genomics (AREA)
- Physiology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)
Abstract
Classify field the invention belongs to information, and in particular to a kind of text feature based on feature coding, system, device, it is intended to solve the problems, such as in Text character extraction that high computational complexity, classification effectiveness and precision are low.The method of the present invention includes: the Text Pretreatment to acquisition, obtains word candidate feature sequence;Word-based candidate feature sequence, generates multiple binary codings;Binary coding is screened using Gene hepatitis B vaccine, obtains optimal binary coding;Optimal binary coding is decoded to obtain optimal word characteristic sequence and export.A series of candidate features are converted tractable coded sequence by the present invention, and using the automatic screening function of Gene hepatitis B vaccine, carries out maximized global optimum to feature and select, can effectively filter out minimum validity feature collection.
Description
Technical field
Classify field the invention belongs to information, and in particular to a kind of text feature based on feature coding,
System, device.
Background technique
With the rapid development and universal, facing growing mass data of Internet technology, how sufficiently effectively
Utilization have become the task of top priority of major Internet company and related scientific research mechanism.In these data, text class
Data are quantity one kind the hugest again.Half of the country is being occupied in use, classifying to text data, what is referred to is
Under given classification system, the process of text categories is automatically determined according to content of text.Present text classification has pole
To be widely applied scene, for example, article content is based on, by these articles to a large amount of report articles for including in news website
Classified automatically by subject matter;To in e-commerce website, classify after customer transaction behavior to the evaluation of commodity made;
To the waste advertisements information that E-mail address frequently receives, spam is identified from numerous mails by Text Classification
And it filters;To a large amount of submissions that media receive daily, article is audited automatically by Text Classification, thus realization pair
The label of waste advertisements in submission, the violations content such as relate to Huang, violence.
Before the 1990s, prevailing file classification method is always heuristic: by profession
The help of personnel is a large amount of inference rule of each class declaration, can be with if a document is able to satisfy these inference rules
Judgement belongs to the category.But this method have the shortcomings that it is obvious: the quality of classification be largely dependent upon rule it is good
It is bad;A large amount of professional is needed to carry out the formulation of rule;Do not have replicability, different fields needs to construct entirely different
Categorizing system, cause the huge waste exploited natural resources with fund resources.
The machine learning techniques of popular can be well solved the above problem.Machine learning based on statistical theory,
There is automatic " study " ability as the similar mankind by machine using algorithm, i.e., statistical analysis is done to known training data thus
It obtains rule, then gives a forecast analysis with regular to unknown data.Machine learning method is used in the basic mistake in text classification
Journey are as follows: mark, using manually having carried out Accurate classification to a collection of document, using as training set (material for carrying out machine learning);
Training, computer excavate some rules that can effectively classify from these documents, generate classifier;Classification, by generation
Classifier is applied in having collection of document to be sorted, and the classification results of document are obtained.
Feature extraction is important ring when doing text classification using machine learning.Current most of Chinese Text Categorization systems
System is all using word as characteristic item, referred to as Feature Words.Intermediate representation of these Feature Words as document, for realizing document
With the similarity calculation between document, document and ownership goal.If using all words all as characteristic item, feature vector
Dimension will be excessively high, great pressure can be caused to the operational performance of categorizing system, the timeliness of text classification is caused to reduce.Cause
And seek a kind of effective feature dimension reduction method, come reduce computational complexity, improve classification efficiency and precision, be it is current this
A field there is an urgent need to.
Summary of the invention
In order to solve the above problem in the prior art, i.e., computational complexity height, classification effectiveness in Text character extraction
The problem low with precision, the present invention provides a kind of text features based on feature coding, comprising:
Step S10 obtains the word candidate feature sequence of input text;
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer;
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding;
Step S40 decodes the optimal binary coding, obtains corresponding optimal word characteristic sequence as the text extracted
Eigen simultaneously exports.
In some preferred embodiments, " the word candidate feature sequence for obtaining input text ", step in step S10
Are as follows:
The text of input is divided into word using text segmentation methods, constitutes text word set by step S11;
Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set;
Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.
In some preferred embodiments, in step S20 " it is based on institute's predicate candidate feature sequence, M binary system is generated and compiles
Code ", the steps include:
Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence;
The M random character sequence is generated M length identical with institute's predicate candidate feature sequence two by step S22
Scale coding.
In some preferred embodiments, " to the M binary coding, sieved using Gene hepatitis B vaccine in step S30
Choosing, obtains optimal binary coding ", it the steps include:
Step S31 using the M binary coding as M group gene families, and is calculated every in the M group gene families
The fitness of individual;
Step S32, the fitness based on each individual in the M group gene families are obtained using Gene hepatitis B vaccine method
Obtain optimal binary coding.
In some preferred embodiments, in step S32 " fitness based on each individual in the M group gene families,
Using Gene hepatitis B vaccine, optimal binary coding is obtained ", it the steps include:
Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families:
Wherein, f (xi) be i-th of gene families individual fitness function, f (xj) it is the suitable of j-th of gene families individual
Response function;
Step S322 calculates the tired of each individual according to the probability that each individual is genetic in next-generation group
Count probability:
Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] sectioni, then individual is selected
1, otherwise, individual k is selected, so that: qk-1< r≤qkIt sets up;
Step S324, it is 2M times total to repeat step S333, selects M group individual, to every group of two individuals in the M group with
Crossing-over rate α triggering single point crossing exchanges to obtain a filial generation binary coding;
Step S325, with aberration rate βmThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and sets
It changes, obtains optimal binary coding.
In some preferred embodiments, in step S31 " fitness for calculating each individual in the M group gene families "
Later, genetic mutation rate can also be calculated, Gene hepatitis B vaccine efficiency is improved:
Wherein, βmThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree,
βmaxIt is maximum fitness, β in groupavgIt is the average fitness of group, k1、k2For constant.
Another aspect of the present invention proposes a kind of Text character extraction system based on feature coding, including obtains mould
Block, preprocessing module, feature coding module, Feature Selection module, decoder module, output module;
The acquisition module is configured to obtain text and input;
The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence;
The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, and M is positive
Integer;
The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains
Optimal binary coding;
The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence;
The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.
The third aspect of the present invention proposes a kind of storage device, wherein be stored with a plurality of program, described program be suitable for by
Processor is loaded and is executed to realize the above-mentioned text feature based on feature coding.
The fourth aspect of the present invention proposes a kind of processing unit, including processor, storage device;The processor is fitted
In each program of execution;The storage device is suitable for storing a plurality of program;Described program be suitable for loaded by processor and executed with
Realize the above-mentioned text feature based on feature coding.
Beneficial effects of the present invention:
(1) the present invention is based on the text features of feature coding, in conjunction with Gene hepatitis B vaccine, realize to text spy
The selection of sign can effectively overcome the limitation faced in traditional text Feature Selection to mention as far as possible in controllable range
The accuracy of high text feature, while Feature Dimension Reduction is realized to greatest extent, effectively improve feature service efficiency.
(2) present invention for feature obtained by existing text feature highly redundant and low precision the shortcomings that, propose
Feature Selection method based on feature coding and Gene hepatitis B vaccine, this method convert a series of candidate features to easy to handle
Coded sequence, and carry out automatic screening using Gene hepatitis B vaccine, maximized global optimum carried out to feature and is selected, can be with
Effectively filter out minimum validity feature collection.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is the flow diagram of the text feature the present invention is based on feature coding;
Fig. 2 is that the Text Pretreatment of the text feature the present invention is based on feature coding obtains the stream of candidate sequence
Journey schematic diagram;
Fig. 3 is the feature coding flow diagram of the text feature the present invention is based on feature coding;
Fig. 4 is the Gene hepatitis B vaccine flow diagram of the text feature the present invention is based on feature coding;
Fig. 5 is to intersect to hand over the present invention is based on a kind of binary coding of embodiment of the text feature of feature coding
Change process example figure;
Fig. 6 is that the present invention is based on a kind of mutation of the binary coding of embodiment of the text feature of feature coding to show
Example diagram.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is only used for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to just
Part relevant to related invention is illustrated only in description, attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
The present invention provides a kind of text feature based on feature coding, is encoded based on binary text feature
Method, and Gene hepatitis B vaccine is combined, it realizes the selection to text feature, can effectively overcome institute face in traditional text Feature Selection
The limitation faced improves the accuracy of text feature in controllable range as far as possible, while realizing feature to greatest extent
Dimensionality reduction effectively improves feature service efficiency.
A kind of text feature based on feature coding of the invention, comprising:
Step S10 obtains the word candidate feature sequence of input text;
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer;
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding;
Step S40 decodes the optimal binary coding, obtains corresponding optimal word characteristic sequence as the text extracted
Eigen simultaneously exports.
In order to be more clearly illustrated to the text feature the present invention is based on feature coding, below with reference to figure
Each step expansion is described in detail in 1 pair of embodiment of the present invention method.
The text feature based on feature coding of an embodiment of the present invention, including step S10- step S40,
Each step is described in detail as follows:
Step S10 obtains the word candidate feature sequence of input text.As shown in Fig. 2, for the present invention is based on feature codings
The Text Pretreatment of text feature obtains the flow diagram of candidate sequence, segments to text, then carries out first
Word weight calculation ultimately produces candidate feature sequence, specific as follows:
The text of input is divided into word using text segmentation methods, constitutes text word set by step S11.
Text participle is a basic steps of text-processing and the basic module of man-machine natural language interaction.Chinese
Text and English text the difference is that, there is no the boundary of word, therefore underway literary natural language processing in Chinese sentence
When, it usually needs it is first segmented, participle effect will directly affect the effect of the modules such as part of speech, syntax tree.Certainly participle is
One tool, scene are different, it is desirable that also different.In man-machine natural language interaction, mature Chinese Word Automatic Segmentation can reach
Better natural language processing effect, helps the Chinese language of computer understanding complexity.
Text segmentation methods have: the segmentation methods based on dictionary, for example, Forward Maximum Method method, reverse maximum matching method and
Bi-directional matching participle method etc.;Machine learning algorithm based on statistics, such as Hidden Markov Model algorithm (HMM, Hidden
Markov Model), condition random field algorithm (CRF, Conditional Random Field algorithm), deep learning
Algorithm etc.;There are also participle methods neural network based, no longer introduce one by one herein.
Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set.
Word weight calculation has had mature method, and used in the embodiment of the present invention is common TF-IDF (Term
Frequency-Inverse Document Frequency) method progress weight calculation.TF-IDF is a kind of statistical method, is used
To assess a words for the significance level of a copy of it file in a file set or a corpus.Words it is important
The directly proportional increase of number that property occurs hereof with it, but the frequency that can occur in corpus with it simultaneously is inversely proportional
Decline.
Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer.As shown in figure 3,
For it is suitable to firstly generate random character the present invention is based on the feature coding flow diagram of the text feature of feature coding
Then sequence arrangement generates multiple groups random binary coding according to sequence, specific as follows:
Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence.
The M random character sequence is generated M length identical with institute's predicate candidate feature sequence two by step S22
Scale coding.
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding,
As shown in figure 4, for the present invention is based on the Gene hepatitis B vaccine flow diagrams of the text feature of feature coding, specifically
It is as follows:
Step S31 using the M binary coding as M group gene families, and is calculated every in the M group gene families
The fitness of individual;
Step S32, the fitness based on each individual in the M group gene families are obtained using Gene hepatitis B vaccine method
Obtain optimal binary coding.
In the preferred embodiment of the invention, optimal binary coding is screened using roulette wheel selection.
Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families, such as formula
(1) shown in:
Wherein, f (xi) be i-th of gene families individual fitness function, f (xj) it is the suitable of j-th of gene families individual
Response function;
Step S322 calculates the tired of each individual according to the probability that each individual is genetic in next-generation group
Probability is counted, as shown in formula (2):
Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] sectioni, then individual is selected
1, otherwise, individual k is selected, so that: qk-1< r≤qkIt sets up;
Step S324, it is 2M times total to repeat step S333, selects M group individual, to every group of two individuals in the M group with
Crossing-over rate α triggering single point crossing exchanges to obtain a filial generation binary coding.As shown in figure 5, for the present invention is based on feature codings
A kind of embodiment of text feature binary coding cross exchange process example figure, one group of binary system is compiled first
Code is replicated, and then carries out cross exchange, the random binary coding saved after an exchange to obtained binary coding.
Step S325, with aberration rate βmThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and sets
It changes, obtains optimal binary coding.As shown in fig. 6, for the present invention is based on a kind of realities of the text feature of feature coding
The binary coding mutation exemplary diagram for applying example, the binary coding before variation and after variation is only phase antirepresentation in compiling point,
It is remaining identical.
In step S31 after " fitness for calculating each individual in the M group gene families ", gene change can also be calculated
Different rate improves Gene hepatitis B vaccine efficiency, as shown in formula (3):
Wherein, βmThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree,
βmaxIt is maximum fitness, β in groupavgIt is the average fitness of group, k1、k2For constant.
The optimal binary coding is decoded, corresponding optimal word characteristic sequence is obtained, as extraction by step S40
Text feature simultaneously exports.
The Text character extraction system based on feature coding of second embodiment of the invention, including obtain module, pretreatment
Module, feature coding module, Feature Selection module, decoder module, output module;
The acquisition module is configured to obtain text and input;
The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence;
The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, and M is positive
Integer;
The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains
Optimal binary coding;
The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence;
The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.
Person of ordinary skill in the field can be understood that, for convenience and simplicity of description, foregoing description
The specific work process of system and related explanation, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
It should be noted that the Text character extraction system provided by the above embodiment based on feature coding, only with above-mentioned
The division of each functional module carries out for example, in practical applications, can according to need and by above-mentioned function distribution by difference
Functional module complete, i.e., by the embodiment of the present invention module or step decompose or combine again, for example, above-mentioned implementation
The module of example can be merged into a module, multiple submodule can also be further split into, to complete whole described above
Or partial function.For module involved in the embodiment of the present invention, the title of step, it is only for distinguish modules or
Person's step, is not intended as inappropriate limitation of the present invention.
A kind of storage device of third embodiment of the invention, wherein being stored with a plurality of program, described program is suitable for by handling
Device is loaded and is executed to realize the above-mentioned text feature based on feature coding.
A kind of processing unit of fourth embodiment of the invention, including processor, storage device;Processor is adapted for carrying out each
Program;Storage device is suitable for storing a plurality of program;Described program is suitable for being loaded by processor and being executed to realize above-mentioned base
In the text feature of feature coding.
Person of ordinary skill in the field can be understood that, for convenience and simplicity of description, foregoing description
The specific work process and related explanation of storage device, processing unit, can refer to corresponding processes in the foregoing method embodiment,
Details are not described herein.
Those skilled in the art should be able to recognize that, mould described in conjunction with the examples disclosed in the embodiments of the present disclosure
Block, method and step, can be realized with electronic hardware, computer software, or a combination of the two, software module, method and step pair
The program answered can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electric erasable and can compile
Any other form of storage well known in journey ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field is situated between
In matter.In order to clearly demonstrate the interchangeability of electronic hardware and software, in the above description according to function generally
Describe each exemplary composition and step.These functions are executed actually with electronic hardware or software mode, depend on technology
The specific application and design constraint of scheme.Those skilled in the art can carry out using distinct methods each specific application
Realize described function, but such implementation should not be considered as beyond the scope of the present invention.
Term " includes " or any other like term are intended to cover non-exclusive inclusion, so that including a system
Process, method, article or equipment/device of column element not only includes those elements, but also including being not explicitly listed
Other elements, or further include the intrinsic element of these process, method, article or equipment/devices.
So far, it has been combined preferred embodiment shown in the drawings and describes technical solution of the present invention, still, this field
Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this
Under the premise of the principle of invention, those skilled in the art can make equivalent change or replacement to the relevant technologies feature, these
Technical solution after change or replacement will fall within the scope of protection of the present invention.
Claims (9)
1. a kind of text feature based on feature coding characterized by comprising
Step S10 obtains the word candidate feature sequence of input text;
Step S20 is based on institute's predicate candidate feature sequence, generates M binary coding, M is positive integer;
Step S30 screens the M binary coding using Gene hepatitis B vaccine, obtains optimal binary coding;
Step S40 decodes the optimal binary coding, and it is special as the text extracted to obtain corresponding optimal word characteristic sequence
It levies and exports.
2. the text feature according to claim 1 based on feature coding, which is characterized in that in step S10
" the word candidate feature sequence for obtaining input text ", the steps include:
The text of input is divided into word using text segmentation methods, constitutes text word set by step S11;
Step S12 carries out weight calculation to each word in the text word set, obtains the corresponding weight of text word set;
Step S13 chooses the word of preset quantity as word candidate feature sequence according to the sequence of weight from big to small.
3. the text feature according to claim 1 based on feature coding, which is characterized in that in step S20
" being based on institute's predicate candidate feature sequence, generate M binary coding ", the steps include:
Step S21 carries out random alignment to the word in institute's predicate candidate feature sequence, obtains M random character sequence;
The M random character sequence is generated M length binary system identical with institute's predicate candidate feature sequence by step S22
Coding.
4. the text feature according to claim 1 based on feature coding, which is characterized in that in step S30
" to the M binary coding, being screened using Gene hepatitis B vaccine, obtain optimal binary coding ", the steps include:
Step S31 using the M binary coding as M group gene families, and is calculated in the M group gene families per each and every one
The fitness of body;
Step S32 obtains optimal two using Gene hepatitis B vaccine based on the fitness of each individual in the M group gene families
Scale coding.
5. the text feature according to claim 4 based on feature coding, which is characterized in that in step S32
" fitness based on each individual in the M group gene families is obtained optimal binary system and is compiled using roulette selection method
Code ", the steps include:
Step S321 calculates the probability that each individual is genetic in next-generation group in the M group gene families:
Wherein, f (xi) be i-th of gene families individual fitness function, f (xj) be j-th of gene families individual fitness
Function;;
Step S322 calculates the accumulative general of each individual according to the probability that each individual is genetic in next-generation group
Rate:
Step S323 generates an equally distributed pseudo random number r, if r < q in [0,1] sectioni, then individual 1 is selected, it is no
Then, individual k is selected, so that: qk-1< r≤qkIt sets up;
Step S324, it is 2M times total to repeat step S333, M group individual is selected, to every group of two individuals in the M group to intersect
Rate α triggering single point crossing exchanges to obtain a filial generation binary coding;
Step S325, with aberration rate βmThe a certain position in the filial generation binary coding is triggered, Binary Zero -1 occurs and replaces, obtains
Optimal binary coding.
6. the text feature according to claim 4 based on feature coding, which is characterized in that in step S31
After " fitness for calculating each individual in the M group gene families ", genetic mutation rate can also be calculated, improves gene genetic
Efficiency of algorithm:
Wherein, βmThe genetic mutation rate of dynamic change is provided for fitness different distributions in group, β is individual adaptation degree, βmaxIt is
Maximum fitness, β in groupavgIt is the average fitness of group, k1、k2For constant.
7. a kind of Text character extraction system based on feature coding, which is characterized in that including obtain module, preprocessing module,
Feature coding module, Feature Selection module, decoder module, output module;
The acquisition module is configured to obtain text and input;
The preprocessing module is configured to the Text Pretreatment to acquisition, obtains word candidate feature sequence;
The feature coding module is configured to institute's predicate candidate feature sequence, generates M binary coding, M is positive whole
Number;
The Feature Selection module is configured to screen the M binary coding using Gene hepatitis B vaccine, obtains optimal
Binary coding;
The decoder module is configured to decode the optimal binary coding, obtains corresponding optimal word characteristic sequence;
The output module is configured to using optimal word characteristic sequence as the text feature and output extracted.
8. a kind of storage device, wherein being stored with a plurality of program, which is characterized in that described program is suitable for being loaded and being held by processor
Row is to realize the text feature described in any one of claims 1-6 based on feature coding.
9. a kind of processing unit, including
Processor is adapted for carrying out each program;And
Storage device is suitable for storing a plurality of program;
It is characterized in that, described program is suitable for being loaded by processor and being executed to realize:
Text feature described in any one of claims 1-6 based on feature coding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910205999.6A CN109977227B (en) | 2019-03-19 | 2019-03-19 | Text feature extraction method, system and device based on feature coding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910205999.6A CN109977227B (en) | 2019-03-19 | 2019-03-19 | Text feature extraction method, system and device based on feature coding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109977227A true CN109977227A (en) | 2019-07-05 |
CN109977227B CN109977227B (en) | 2021-06-22 |
Family
ID=67079264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910205999.6A Active CN109977227B (en) | 2019-03-19 | 2019-03-19 | Text feature extraction method, system and device based on feature coding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977227B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116738354A (en) * | 2023-08-15 | 2023-09-12 | 国网江西省电力有限公司信息通信分公司 | Method and system for detecting abnormal behavior of electric power Internet of things terminal |
Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020133297A1 (en) * | 2001-01-17 | 2002-09-19 | Jinn-Moon Yang | Ligand docking method using evolutionary algorithm |
WO2004053766A1 (en) * | 2002-12-06 | 2004-06-24 | London Health Sciences Centre Research Inc. | Reverse translation of protein sequences to nucleotide code |
US20070031042A1 (en) * | 2005-08-02 | 2007-02-08 | Edmundo Simental | Efficient imagery exploitation employing wavelet-based feature indices |
US20070061319A1 (en) * | 2005-09-09 | 2007-03-15 | Xerox Corporation | Method for document clustering based on page layout attributes |
CN101068108A (en) * | 2007-06-18 | 2007-11-07 | 北京中星微电子有限公司 | Orthogonal mirror image filter group realizing method and device based on genetic algorithm |
CN101246555A (en) * | 2008-03-11 | 2008-08-20 | 中国科学技术大学 | Characteristic optimization method based on coevolution for foot passenger detection |
CN101256648A (en) * | 2008-04-09 | 2008-09-03 | 永凯软件技术(上海)有限公司 | Genetic operation operator based on indent structure for producing quening system |
CN101271572A (en) * | 2008-03-28 | 2008-09-24 | 西安电子科技大学 | Image segmentation method based on immunity clone selection clustering |
CN101315557A (en) * | 2008-06-25 | 2008-12-03 | 浙江大学 | Propylene polymerization production process optimal soft survey instrument and method based on genetic algorithm optimization BP neural network |
CN101436345A (en) * | 2008-12-19 | 2009-05-20 | 天津市市政工程设计研究院 | System for forecasting harbor district road traffic requirement based on TransCAD macroscopic artificial platform |
CN101533423A (en) * | 2009-04-14 | 2009-09-16 | 江苏大学 | Method for optimizing structure of metallic-plastic composite material |
CN101587545A (en) * | 2009-06-19 | 2009-11-25 | 中国农业大学 | Method and system for selecting feature of cotton heterosexual fiber target image |
CN101599078A (en) * | 2009-07-10 | 2009-12-09 | 腾讯科技(深圳)有限公司 | A kind of method of text retrieval and device |
CN101710333A (en) * | 2009-11-26 | 2010-05-19 | 西北工业大学 | Network text segmenting method based on genetic algorithm |
CN101814086A (en) * | 2010-02-05 | 2010-08-25 | 山东师范大学 | Chinese WEB information filtering method based on fuzzy genetic algorithm |
CN101882791A (en) * | 2010-07-13 | 2010-11-10 | 东北电力大学 | Controllable serial capacitor optimal configuration method capable of improving available transmission capacity |
CN101968853A (en) * | 2010-10-15 | 2011-02-09 | 吉林大学 | Improved immune algorithm based expression recognition method for optimizing support vector machine parameters |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104657472A (en) * | 2015-02-13 | 2015-05-27 | 南京邮电大学 | EA (Evolutionary Algorithm)-based English text clustering method |
CN105005792A (en) * | 2015-07-13 | 2015-10-28 | 河南科技大学 | KNN algorithm based article translation method |
CN105740227A (en) * | 2016-01-21 | 2016-07-06 | 云南大学 | Genetic simulated annealing method for solving new words in Chinese segmentation |
CN105787088A (en) * | 2016-03-14 | 2016-07-20 | 南京理工大学 | Text information classifying method based on segmented encoding genetic algorithm |
CN106971170A (en) * | 2017-04-07 | 2017-07-21 | 西北工业大学 | A kind of method for carrying out target identification using one-dimensional range profile based on genetic algorithm |
-
2019
- 2019-03-19 CN CN201910205999.6A patent/CN109977227B/en active Active
Patent Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020133297A1 (en) * | 2001-01-17 | 2002-09-19 | Jinn-Moon Yang | Ligand docking method using evolutionary algorithm |
WO2004053766A1 (en) * | 2002-12-06 | 2004-06-24 | London Health Sciences Centre Research Inc. | Reverse translation of protein sequences to nucleotide code |
US20070031042A1 (en) * | 2005-08-02 | 2007-02-08 | Edmundo Simental | Efficient imagery exploitation employing wavelet-based feature indices |
US20070061319A1 (en) * | 2005-09-09 | 2007-03-15 | Xerox Corporation | Method for document clustering based on page layout attributes |
CN101068108A (en) * | 2007-06-18 | 2007-11-07 | 北京中星微电子有限公司 | Orthogonal mirror image filter group realizing method and device based on genetic algorithm |
CN101246555A (en) * | 2008-03-11 | 2008-08-20 | 中国科学技术大学 | Characteristic optimization method based on coevolution for foot passenger detection |
CN101271572A (en) * | 2008-03-28 | 2008-09-24 | 西安电子科技大学 | Image segmentation method based on immunity clone selection clustering |
CN101256648A (en) * | 2008-04-09 | 2008-09-03 | 永凯软件技术(上海)有限公司 | Genetic operation operator based on indent structure for producing quening system |
CN101315557A (en) * | 2008-06-25 | 2008-12-03 | 浙江大学 | Propylene polymerization production process optimal soft survey instrument and method based on genetic algorithm optimization BP neural network |
CN101436345A (en) * | 2008-12-19 | 2009-05-20 | 天津市市政工程设计研究院 | System for forecasting harbor district road traffic requirement based on TransCAD macroscopic artificial platform |
CN101533423A (en) * | 2009-04-14 | 2009-09-16 | 江苏大学 | Method for optimizing structure of metallic-plastic composite material |
CN101587545A (en) * | 2009-06-19 | 2009-11-25 | 中国农业大学 | Method and system for selecting feature of cotton heterosexual fiber target image |
CN101599078A (en) * | 2009-07-10 | 2009-12-09 | 腾讯科技(深圳)有限公司 | A kind of method of text retrieval and device |
CN101710333A (en) * | 2009-11-26 | 2010-05-19 | 西北工业大学 | Network text segmenting method based on genetic algorithm |
CN101814086A (en) * | 2010-02-05 | 2010-08-25 | 山东师范大学 | Chinese WEB information filtering method based on fuzzy genetic algorithm |
CN101882791A (en) * | 2010-07-13 | 2010-11-10 | 东北电力大学 | Controllable serial capacitor optimal configuration method capable of improving available transmission capacity |
CN101968853A (en) * | 2010-10-15 | 2011-02-09 | 吉林大学 | Improved immune algorithm based expression recognition method for optimizing support vector machine parameters |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104657472A (en) * | 2015-02-13 | 2015-05-27 | 南京邮电大学 | EA (Evolutionary Algorithm)-based English text clustering method |
CN105005792A (en) * | 2015-07-13 | 2015-10-28 | 河南科技大学 | KNN algorithm based article translation method |
CN105740227A (en) * | 2016-01-21 | 2016-07-06 | 云南大学 | Genetic simulated annealing method for solving new words in Chinese segmentation |
CN105787088A (en) * | 2016-03-14 | 2016-07-20 | 南京理工大学 | Text information classifying method based on segmented encoding genetic algorithm |
CN106971170A (en) * | 2017-04-07 | 2017-07-21 | 西北工业大学 | A kind of method for carrying out target identification using one-dimensional range profile based on genetic algorithm |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116738354A (en) * | 2023-08-15 | 2023-09-12 | 国网江西省电力有限公司信息通信分公司 | Method and system for detecting abnormal behavior of electric power Internet of things terminal |
CN116738354B (en) * | 2023-08-15 | 2023-12-08 | 国网江西省电力有限公司信息通信分公司 | Method and system for detecting abnormal behavior of electric power Internet of things terminal |
Also Published As
Publication number | Publication date |
---|---|
CN109977227B (en) | 2021-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Akhtar et al. | Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis | |
Day et al. | Deep learning for financial sentiment analysis on finance news providers | |
Demir et al. | Improving named entity recognition for morphologically rich languages using word embeddings | |
CN104951548B (en) | A kind of computational methods and system of negative public sentiment index | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
CN102123172B (en) | Implementation method of Web service discovery based on neural network clustering optimization | |
CN112632228A (en) | Text mining-based auxiliary bid evaluation method and system | |
Ghosh et al. | Natural language processing fundamentals: build intelligent applications that can interpret the human language to deliver impactful results | |
Jerzak et al. | An improved method of automated nonparametric content analysis for social science | |
Sharma et al. | SentiDraw: Using star ratings of reviews to develop domain specific sentiment lexicon for polarity determination | |
US20120089620A1 (en) | Extracting data | |
CN114969275A (en) | Conversation method and system based on bank knowledge graph | |
Huang et al. | Text classification with document embeddings | |
Ekbal et al. | Simultaneous feature and parameter selection using multiobjective optimization: application to named entity recognition | |
Abid et al. | Semi-automatic classification and duplicate detection from human loss news corpus | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN109977227A (en) | Text feature, system, device based on feature coding | |
Liu | Automatic argumentative-zoning using word2vec | |
Makrehchi et al. | Text classification using small number of features | |
CN113312903B (en) | Method and system for constructing word stock of 5G mobile service product | |
Nevezhin et al. | Topic-driven ensemble for online advertising generation | |
CN114579729A (en) | FAQ question-answer matching method and system fusing multi-algorithm model | |
CN112818215A (en) | Product data processing method, device, equipment and storage medium | |
Alharithi | Performance analysis of machine learning approaches in automatic classification of Arabic language | |
Yuan et al. | Big Data Aspect‐Based Opinion Mining Using the SLDA and HME‐LDA Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |