CN109977227B - Text feature extraction method, system and device based on feature coding - Google Patents

Text feature extraction method, system and device based on feature coding Download PDF

Info

Publication number
CN109977227B
CN109977227B CN201910205999.6A CN201910205999A CN109977227B CN 109977227 B CN109977227 B CN 109977227B CN 201910205999 A CN201910205999 A CN 201910205999A CN 109977227 B CN109977227 B CN 109977227B
Authority
CN
China
Prior art keywords
feature
text
fitness
individual
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910205999.6A
Other languages
Chinese (zh)
Other versions
CN109977227A (en
Inventor
张旭
熊彦钧
何赛克
刘春阳
郑晓龙
陈志鹏
曾大军
彭鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Original Assignee
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, National Computer Network and Information Security Management Center filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201910205999.6A priority Critical patent/CN109977227B/en
Publication of CN109977227A publication Critical patent/CN109977227A/en
Application granted granted Critical
Publication of CN109977227B publication Critical patent/CN109977227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of information classification, and particularly relates to a text feature extraction method, system and device based on feature coding, aiming at solving the problems of high operation complexity, low classification efficiency and low precision in text feature extraction. The method comprises the following steps: preprocessing the acquired text to acquire a word candidate characteristic sequence; generating a plurality of binary codes based on the word candidate feature sequence; screening binary codes by adopting a genetic algorithm to obtain optimal binary codes; and decoding the optimal binary code to obtain and output an optimal word characteristic sequence. The invention converts a series of candidate characteristics into an easily-processed coding sequence, and uses the automatic screening function of the genetic algorithm to carry out the maximized global optimal selection on the characteristics, thereby effectively screening the minimum effective characteristic set.

Description

Text feature extraction method, system and device based on feature coding
Technical Field
The invention belongs to the field of information classification, and particularly relates to a text feature extraction method, system and device based on feature coding.
Background
With the rapid development and popularization of internet technology, it is urgent to make full use of mass data that is growing day by day for each large internet company and related scientific research institutions. Among these data, the text-type data is the most voluminous. In the use of text data, classification occupies the half-wall Jiangshan, which refers to a process of automatically determining text categories according to text contents under a given classification system. The text classification at present has a very wide application scene, for example, a large number of report articles contained in a news website are automatically classified according to the subject based on the article content; classifying the evaluation made on the commodities after the user transacts in the e-commerce website; identifying junk mails from a plurality of mails through a text classification technology and filtering junk advertisement information frequently received by an electronic mailbox; a large number of postings received by the media every day are automatically checked by means of a text classification technology, so that illegal contents such as junk advertisements, yellow-related contents, violence and the like in the postings are marked.
Until the 90 s of the 20 th century, the dominant text classification methods have been heuristic: with the help of professionals, a large number of inference rules are defined for each category, and if a document can satisfy these inference rules, it can be judged to belong to the category. However, this method has significant disadvantages: the quality of classification depends largely on the quality of the rules; a large number of professionals are required to make rules; the method has no popularization, and different fields need to construct completely different classification systems, thereby causing huge waste of development resources and capital resources.
The current machine learning techniques are well suited to solve the above problems. Machine learning is based on a statistical theory, an algorithm is utilized to enable a machine to have human-like automatic learning capacity, namely, known training data are subjected to statistical analysis to obtain a rule, and then the rule is utilized to perform predictive analysis on unknown data. The basic process of applying the machine learning method to text classification is as follows: labeling, namely accurately classifying a batch of documents by manual work to serve as a training set (material for machine learning); training, wherein the computer digs some rules capable of being effectively classified from the documents to generate a classifier; and classifying, namely applying the generated classifier to the document set to be classified to obtain the classification result of the document.
Feature extraction is an important ring when using machine learning for text classification. Most of the current Chinese text classification systems use words as feature items, called feature words. The characteristic words are used as an intermediate representation form of the document and are used for realizing similarity calculation between the document and between the document and a user target. If all words are used as feature items, the dimension of the feature vector is too high, which causes great stress on the operation performance of the classification system, and results in reduced timeliness of text classification. Therefore, an effective feature dimension reduction method is sought to reduce the operation complexity and improve the classification efficiency and precision, which is urgently needed in the field at present.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, the problems of high operation complexity, low classification efficiency and low precision in text feature extraction, the present invention provides a text feature extraction method based on feature coding, which includes:
step S10, acquiring a word candidate characteristic sequence of an input text;
step S20, generating M binary codes based on the word candidate characteristic sequence, wherein M is a positive integer;
s30, screening the M binary codes by adopting a genetic algorithm to obtain optimal binary codes;
and step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence as the extracted text feature and outputting the extracted text feature.
In some preferred embodiments, the step of "obtaining a word candidate feature sequence of the input text" in step S10 includes the steps of:
step S11, dividing the input text into words by text word segmentation algorithm to form a text word set;
step S12, carrying out weight calculation on each word in the text word set to obtain the weight corresponding to the text word set;
and step S13, selecting words with preset number as word candidate characteristic sequences according to the sequence of the weights from big to small.
In some preferred embodiments, step S20, "generate M binary codes based on the word candidate feature sequence", includes the steps of:
step S21, randomly arranging words in the word candidate characteristic sequence to obtain M random characteristic sequences;
and step S22, generating M binary codes with the length same as that of the word candidate characteristic sequence from the M random characteristic sequences.
In some preferred embodiments, the step S30 "screening the M binary codes by using genetic algorithm to obtain the optimal binary code" includes the steps of:
step S31, taking the M binary codes as an M group gene population, and calculating the fitness of each individual in the M group gene population;
and step S32, obtaining the optimal binary code by adopting a genetic algorithm method based on the fitness of each individual in the M groups of gene populations.
In some preferred embodiments, the step S32 "obtaining the optimal binary code by using genetic algorithm based on the fitness of each individual in the M-group gene population" comprises the following steps:
step S321, calculating the probability that each individual in the M groups of gene populations is inherited into the next generation population:
Figure BDA0001998958760000041
wherein, f (x)i) Is as followsFitness function of i individuals of the gene population, f (x)j) Is the fitness function of the jth gene population;
step S322, calculating the cumulative probability of each individual according to the probability of each individual being inherited to the next generation group:
Figure BDA0001998958760000042
step S323, at [0, 1 ]]Generating a uniformly distributed pseudo-random number r in the interval if r < qiThen individual 1 is selected, otherwise individual k is selected such that: q. q.sk-1<r≤qkIf true;
step S324, repeatedly executing step S333 for 2M times, selecting M groups of individuals, and triggering single-point cross exchange on each group of two individuals in the M groups at a cross rate alpha to obtain a filial binary code;
step S325, calculating the variance ratio betamTriggering a certain bit in the child binary code, and performing binary 0-1 replacement to obtain the optimal binary code.
In some preferred embodiments, after "calculating the fitness of each individual in the M groups of gene populations" in step S31, the gene variation rate can be calculated to improve the efficiency of genetic algorithm:
Figure BDA0001998958760000043
wherein, betamProviding dynamically changed gene variation rate for different distribution of fitness in population, wherein beta is individual fitness and betamaxIs the maximum fitness in the population, betaavgIs the mean fitness of the population, k1、k2Is a constant.
On the other hand, the invention provides a text feature extraction system based on feature coding, which comprises an acquisition module, a preprocessing module, a feature coding module, a feature screening module, a decoding module and an output module;
the acquisition module is configured to acquire and input a text;
the preprocessing module is configured to preprocess the acquired text to obtain a word candidate characteristic sequence;
the feature coding module is configured to generate M binary codes based on the word candidate feature sequence, wherein M is a positive integer;
the characteristic screening module is configured to screen the M binary codes by adopting a genetic algorithm to obtain optimal binary codes;
the decoding module is configured to decode the optimal binary code to obtain a corresponding optimal word feature sequence;
and the output module is configured to take the optimal word feature sequence as the extracted text feature and output the extracted text feature.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being suitable for being loaded and executed by a processor to implement the above-mentioned feature-coding-based text feature extraction method.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is suitable to be loaded and executed by a processor to implement the above-mentioned feature encoding-based text feature extraction method.
The invention has the beneficial effects that:
(1) the text feature extraction method based on feature coding combines the genetic algorithm to realize the selection of the text features, can effectively overcome the limitations in the traditional text feature selection, improves the accuracy of the text features as much as possible in a controllable range, simultaneously realizes feature dimension reduction to the maximum extent and effectively improves the feature use efficiency.
(2) The invention provides a feature screening method based on feature coding and genetic algorithm aiming at the defects of high redundancy and low precision of the features obtained by the existing text feature extraction method.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a feature-coding-based text feature extraction method according to the present invention;
FIG. 2 is a schematic flow chart of a candidate sequence obtained by text preprocessing of the feature-coding-based text feature extraction method according to the present invention;
FIG. 3 is a schematic diagram of a feature encoding flow of the text feature extraction method based on feature encoding according to the present invention;
FIG. 4 is a schematic diagram of a genetic algorithm of the feature-coding-based text feature extraction method according to the present invention;
FIG. 5 is a diagram illustrating an exemplary cross-exchange process of binary encoding according to an embodiment of the feature-coding-based text feature extraction method of the present invention;
fig. 6 is a diagram illustrating a binary coding mutation according to an embodiment of the method for extracting text features based on feature coding.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The invention provides a text feature extraction method based on feature coding, which is a text feature coding method based on binary system and combines a genetic algorithm to realize the selection of text features, can effectively overcome the limitations in the traditional text feature selection, improves the accuracy of the text features as much as possible in a controllable range, simultaneously realizes feature dimension reduction to the maximum extent and effectively improves the feature use efficiency.
The invention relates to a text feature extraction method based on feature coding, which comprises the following steps:
step S10, acquiring a word candidate characteristic sequence of an input text;
step S20, generating M binary codes based on the word candidate characteristic sequence, wherein M is a positive integer;
s30, screening the M binary codes by adopting a genetic algorithm to obtain optimal binary codes;
and step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence as the extracted text feature and outputting the extracted text feature.
In order to more clearly describe the feature-coding-based text feature extraction method of the present invention, each step in the embodiment of the method of the present invention is described in detail below with reference to fig. 1.
The text feature extraction method based on feature coding of an embodiment of the invention comprises steps S10-S40, and the steps are described in detail as follows:
step S10, a word candidate feature sequence of the input text is obtained. As shown in fig. 2, which is a schematic flow diagram of obtaining a candidate sequence by text preprocessing of the feature coding-based text feature extraction method of the present invention, the text is first segmented into words, then word weight calculation is performed, and finally a candidate feature sequence is generated, specifically as follows:
and step S11, dividing the input text into words by adopting a text word segmentation algorithm to form a text word set.
Text word segmentation is a basic step of text processing and is also a basic module of man-machine natural language interaction. The difference between the chinese text and the english text is that there is no word boundary in the chinese sentence, so when the chinese natural language processing is performed, word segmentation is usually performed first, and the word segmentation effect directly affects the effect of the parts of speech, the syntax tree and other modules. The word segmentation is only a tool, and the scene is different, and the requirements are different. In man-machine natural language interaction, a mature Chinese word segmentation algorithm can achieve a better natural language processing effect and help a computer to understand complex Chinese languages.
The text word segmentation algorithm comprises the following steps: dictionary-based word segmentation algorithms such as a forward maximum matching method, a reverse maximum matching method, a bidirectional matching word segmentation method and the like; statistical-based machine learning algorithms such as Hidden Markov Model (HMM), Conditional Random Field (CRF), deep learning algorithms, and the like; there are also word segmentation methods based on neural networks, which are not described one by one here.
And step S12, performing weight calculation on each word in the text word set to obtain the weight corresponding to the text word set.
The word weight calculation has a mature method, and the embodiment of the invention adopts a commonly used TF-IDF (Term Frequency-Inverse Document Frequency) method for weight calculation. TF-IDF is a statistical method to assess how important a word is for one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
And step S13, selecting words with preset number as word candidate characteristic sequences according to the sequence of the weights from big to small.
And step S20, generating M binary codes based on the word candidate characteristic sequence, wherein M is a positive integer. As shown in fig. 3, which is a schematic diagram of a feature encoding flow of the feature encoding-based text feature extraction method of the present invention, a random feature sequence is first generated, and then a plurality of groups of random binary codes are generated according to the sequence, specifically as follows:
and step S21, randomly arranging the words in the word candidate characteristic sequence to obtain M random characteristic sequences.
And step S22, generating M binary codes with the length same as that of the word candidate characteristic sequence from the M random characteristic sequences.
Step S30, screening the M binary codes by using a genetic algorithm to obtain an optimal binary code, as shown in fig. 4, which is a schematic flow chart of the genetic algorithm of the feature-coding-based text feature extraction method of the present invention, and specifically includes the following steps:
step S31, taking the M binary codes as an M group gene population, and calculating the fitness of each individual in the M group gene population;
and step S32, obtaining the optimal binary code by adopting a genetic algorithm method based on the fitness of each individual in the M groups of gene populations.
In the preferred embodiment of the invention, the optimal binary code is selected using roulette selection.
Step S321, calculating the probability that each individual in the M groups of gene populations is inherited into the next generation population, as shown in formula (1):
Figure BDA0001998958760000091
wherein, f (x)i) As a fitness function of the individual of the ith gene population, f (x)j) Is the fitness function of the jth gene population;
step S322, calculating the cumulative probability of each individual according to the probability that each individual is inherited to the next generation group, as shown in formula (2):
Figure BDA0001998958760000101
step S323, at [0, 1 ]]Generating a uniformly distributed pseudo-random number r in the interval if r < qiThen individual 1 is selected, otherwise individual k is selected such that: q. q.sk-1<r≤qkIf true;
step S324, repeatedly executing step S333 for 2M times, selecting M groups of individuals, and triggering single-point cross exchange on two individuals in each group in the M groups at a cross rate α to obtain a progeny binary code. As shown in fig. 5, which is an exemplary diagram of a binary code cross-exchange process according to an embodiment of the feature-code-based text feature extraction method of the present invention, a set of binary codes is first copied, then the obtained binary codes are cross-exchanged, and one exchanged binary code is randomly stored.
Step S325, calculating the variance ratio betamTriggering a certain bit in the child binary code, and performing binary 0-1 replacement to obtain the optimal binary code. As shown in fig. 6, which is an exemplary diagram of binary coding mutation according to an embodiment of the feature-coding-based text feature extraction method of the present invention, the binary codes before and after mutation are only represented in opposite directions at the compiling point, and the rest are the same.
After "calculating the fitness of each individual in the group M of gene populations" in step S31, the gene variation rate may also be calculated, so as to improve the efficiency of the genetic algorithm, as shown in formula (3):
Figure BDA0001998958760000102
wherein, betamProviding dynamically changed gene variation rate for different distribution of fitness in population, wherein beta is individual fitness and betamaxIs the maximum fitness in the population, betaavgIs the mean fitness of the population, k1、k2Is a constant.
And step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence, and outputting the optimal word feature sequence as the extracted text feature.
The text feature extraction system based on feature coding of the second embodiment of the invention comprises an acquisition module, a preprocessing module, a feature coding module, a feature screening module, a decoding module and an output module;
the acquisition module is configured to acquire and input a text;
the preprocessing module is configured to preprocess the acquired text to obtain a word candidate characteristic sequence;
the feature coding module is configured to generate M binary codes based on the word candidate feature sequence, wherein M is a positive integer;
the characteristic screening module is configured to screen the M binary codes by adopting a genetic algorithm to obtain optimal binary codes;
the decoding module is configured to decode the optimal binary code to obtain a corresponding optimal word feature sequence;
and the output module is configured to take the optimal word feature sequence as the extracted text feature and output the extracted text feature.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the text feature extraction system based on feature coding provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to realize the above-mentioned feature-coding-based text feature extraction method.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable to be loaded and executed by a processor to implement the above-mentioned feature encoding-based text feature extraction method.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (7)

1. A text feature extraction method based on feature coding is characterized by comprising the following steps:
step S10, acquiring a word candidate characteristic sequence of an input text;
step S20, generating M binary codes based on the word candidate characteristic sequence, wherein M is a positive integer;
step S30, taking the M binary codes as an M group gene population, calculating the fitness of each individual in the M group gene population, and obtaining the optimal binary codes by adopting a genetic algorithm based on the fitness of each individual in the M group gene population;
step S40, decoding the optimal binary code to obtain a corresponding optimal word feature sequence as an extracted text feature and outputting the extracted text feature;
after the fitness of each individual in the M groups of gene populations is calculated, the gene variation rate can be calculated:
Figure FDA0002886036020000011
wherein, betamProviding dynamically changed gene variation rate for different distribution of fitness in population, wherein beta is individual fitness and betamaxIs the maximum fitness in the population, betaavgIs the mean fitness of the population, k1、k2Is a constant.
2. The method for extracting text features based on feature coding according to claim 1, wherein the step of "obtaining word candidate feature sequences of the input text" in step S10 comprises:
step S11, dividing the input text into words by text word segmentation algorithm to form a text word set;
step S12, carrying out weight calculation on each word in the text word set to obtain the weight corresponding to the text word set;
and step S13, selecting words with preset number as word candidate characteristic sequences according to the sequence of the weights from big to small.
3. The method for extracting text features based on feature coding according to claim 1, wherein "generating M binary codes based on the word candidate feature sequence" in step S20 includes the steps of:
step S21, randomly arranging words in the word candidate characteristic sequence to obtain M random characteristic sequences;
and step S22, generating M binary codes with the length same as that of the word candidate characteristic sequence from the M random characteristic sequences.
4. The method for extracting text features based on feature coding according to claim 1, wherein in step S30, "based on the fitness of each individual in the M groups of gene populations, an optimal binary code is obtained by using a genetic algorithm", and the steps are as follows:
step S31, calculating the probability that each individual in the M group gene population is inherited into the next generation population:
Figure FDA0002886036020000021
wherein, f (x)i) As a fitness function of the individual of the ith gene population, f (x)j) Is the fitness function of the jth gene population;
step S32, calculating the cumulative probability of each individual according to the probability of each individual being inherited to the next generation group:
Figure FDA0002886036020000022
step S33, at [0, 1 ]]Generating a uniformly distributed pseudo-random number r in the interval if r < qiThen individual 1 is selected, otherwise individual k is selected such that: q. q.sk-1<r≤qkIf true;
step S34, repeating step S33 for 2M times, selecting M groups of individuals, and triggering single-point cross exchange on each group of two individuals in the M groups at a cross rate alpha to obtain a child binary code;
step S35, calculating the variation rate betamTriggering a certain bit in the child binary code, and performing binary 0-1 replacement to obtain the optimal binary code.
5. A text feature extraction system based on feature coding is characterized by comprising an acquisition module, a preprocessing module, a feature coding module, a feature screening module, a decoding module and an output module;
the acquisition module is configured to acquire and input a text;
the preprocessing module is configured to preprocess the acquired text to obtain a word candidate characteristic sequence;
the feature coding module is configured to generate M binary codes based on the word candidate feature sequence, wherein M is a positive integer;
the characteristic screening module is configured to take the M binary codes as an M group gene population, calculate the fitness of each individual in the M group gene population, and obtain the optimal binary codes by adopting a genetic algorithm based on the fitness of each individual in the M group gene population;
the decoding module is configured to decode the optimal binary code to obtain a corresponding optimal word feature sequence;
the output module is configured to take the optimal word feature sequence as the extracted text feature and output the extracted text feature;
after the fitness of each individual in the M groups of gene populations is calculated, the gene variation rate can be calculated:
Figure FDA0002886036020000031
wherein, betamProviding dynamically changed gene variation rate for different distribution of fitness in population, wherein beta is individual fitness and betamaxIs the maximum fitness in the population, betaavgIs the mean fitness of the population, k1、k2Is a constant.
6. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the feature encoding-based text feature extraction method of any one of claims 1-4.
7. A treatment apparatus comprises
A processor adapted to execute various programs; and
a storage device adapted to store a plurality of programs;
wherein the program is adapted to be loaded and executed by a processor to perform:
the feature coding-based text feature extraction method of any one of claims 1 to 4.
CN201910205999.6A 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding Active CN109977227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910205999.6A CN109977227B (en) 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910205999.6A CN109977227B (en) 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding

Publications (2)

Publication Number Publication Date
CN109977227A CN109977227A (en) 2019-07-05
CN109977227B true CN109977227B (en) 2021-06-22

Family

ID=67079264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910205999.6A Active CN109977227B (en) 2019-03-19 2019-03-19 Text feature extraction method, system and device based on feature coding

Country Status (1)

Country Link
CN (1) CN109977227B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738354B (en) * 2023-08-15 2023-12-08 国网江西省电力有限公司信息通信分公司 Method and system for detecting abnormal behavior of electric power Internet of things terminal

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133297A1 (en) * 2001-01-17 2002-09-19 Jinn-Moon Yang Ligand docking method using evolutionary algorithm
WO2004053766A1 (en) * 2002-12-06 2004-06-24 London Health Sciences Centre Research Inc. Reverse translation of protein sequences to nucleotide code
US7805005B2 (en) * 2005-08-02 2010-09-28 The United States Of America As Represented By The Secretary Of The Army Efficient imagery exploitation employing wavelet-based feature indices
US20070061319A1 (en) * 2005-09-09 2007-03-15 Xerox Corporation Method for document clustering based on page layout attributes
CN101068108A (en) * 2007-06-18 2007-11-07 北京中星微电子有限公司 Orthogonal mirror image filter group realizing method and device based on genetic algorithm
CN101246555B (en) * 2008-03-11 2010-07-07 中国科学技术大学 Characteristic optimization method based on coevolution for pedestrian detection
CN101271572B (en) * 2008-03-28 2011-06-01 西安电子科技大学 Image segmentation method based on immunity clone selection clustering
CN101256648A (en) * 2008-04-09 2008-09-03 永凯软件技术(上海)有限公司 Genetic operation operator based on indent structure for producing quening system
CN101315557B (en) * 2008-06-25 2010-10-13 浙江大学 Propylene polymerization production process optimal soft survey instrument and method based on genetic algorithm optimization BP neural network
CN101436345B (en) * 2008-12-19 2010-08-18 天津市市政工程设计研究院 System for forecasting harbor district road traffic requirement based on TransCAD macroscopic artificial platform
CN101533423A (en) * 2009-04-14 2009-09-16 江苏大学 Method for optimizing structure of metallic-plastic composite material
CN101587545B (en) * 2009-06-19 2011-08-31 中国农业大学 Method and system for selecting feature of cotton heterosexual fiber target image
CN101599078B (en) * 2009-07-10 2011-04-20 腾讯科技(深圳)有限公司 Method and device for text retrieval
CN101710333B (en) * 2009-11-26 2012-07-04 西北工业大学 Network text segmenting method based on genetic algorithm
CN101814086A (en) * 2010-02-05 2010-08-25 山东师范大学 Chinese WEB information filtering method based on fuzzy genetic algorithm
CN101882791B (en) * 2010-07-13 2012-12-19 东北电力大学 Controllable serial capacitor optimal configuration method capable of improving available transmission capacity
CN101968853B (en) * 2010-10-15 2013-06-05 吉林大学 Improved immune algorithm based expression recognition method for optimizing support vector machine parameters
CN104063472B (en) * 2014-06-30 2017-02-15 电子科技大学 KNN text classifying method for optimizing training sample set
CN104657472A (en) * 2015-02-13 2015-05-27 南京邮电大学 EA (Evolutionary Algorithm)-based English text clustering method
CN105005792A (en) * 2015-07-13 2015-10-28 河南科技大学 KNN algorithm based article translation method
CN105740227B (en) * 2016-01-21 2019-05-07 云南大学 A kind of genetic simulated annealing method of neologisms in solution Chinese word segmentation
CN105787088B (en) * 2016-03-14 2018-12-07 南京理工大学 A kind of text information classification method based on segment encoding genetic algorithm
CN106971170A (en) * 2017-04-07 2017-07-21 西北工业大学 A kind of method for carrying out target identification using one-dimensional range profile based on genetic algorithm

Also Published As

Publication number Publication date
CN109977227A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN108628971B (en) Text classification method, text classifier and storage medium for unbalanced data set
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN110569353A (en) Attention mechanism-based Bi-LSTM label recommendation method
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN104361037B (en) Microblogging sorting technique and device
CN107273352B (en) Word embedding learning model based on Zolu function and training method
CN115952291B (en) Financial public opinion classification method and system based on multi-head self-attention and LSTM
CN110046356B (en) Label-embedded microblog text emotion multi-label classification method
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN111859967A (en) Entity identification method and device and electronic equipment
CN113297351A (en) Text data labeling method and device, electronic equipment and storage medium
CN110910175A (en) Tourist ticket product portrait generation method
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN114722198A (en) Method, system and related device for determining product classification code
CN111325019A (en) Word bank updating method and device and electronic equipment
Sharma et al. Resume Classification using Elite Bag-of-Words Approach
CN109977227B (en) Text feature extraction method, system and device based on feature coding
CN113761875A (en) Event extraction method and device, electronic equipment and storage medium
Shatalov et al. Named entity recognition problem for long entities in english texts
Mehedi et al. Automatic bangla article content categorization using a hybrid deep learning model
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
Kühl et al. Automatically quantifying customer need tweets: Towards a supervised machine learning approach
CN113361270A (en) Short text optimization topic model method oriented to service data clustering
Andrian et al. Implementation Of Naïve Bayes Algorithm In Sentiment Analysis Of Twitter Social Media Users Regarding Their Interest To Pay The Tax

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant