CN101710333A - Network text segmenting method based on genetic algorithm - Google Patents

Network text segmenting method based on genetic algorithm Download PDF

Info

Publication number
CN101710333A
CN101710333A CN200910219163A CN200910219163A CN101710333A CN 101710333 A CN101710333 A CN 101710333A CN 200910219163 A CN200910219163 A CN 200910219163A CN 200910219163 A CN200910219163 A CN 200910219163A CN 101710333 A CN101710333 A CN 101710333A
Authority
CN
China
Prior art keywords
text
population
vocabulary
individuality
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910219163A
Other languages
Chinese (zh)
Other versions
CN101710333B (en
Inventor
蔡皖东
赵煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANTONG LONGXIANG ELECTRIC EQUIPMENT CO., LTD.
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN2009102191638A priority Critical patent/CN101710333B/en
Publication of CN101710333A publication Critical patent/CN101710333A/en
Application granted granted Critical
Publication of CN101710333B publication Critical patent/CN101710333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a network text segmenting method based on the genetic algorithm, used for segmenting short network texts. The method comprises the following steps of: evaluating a Latent Dirichlet allocation (LDA) model corresponding to a corpus by using a Gibbs sampling method, inferring latent topic information using the model, representing texts by using the latent topic information; then transforming a text-segmenting process into a multi-target optimum process by using a parallel genetic algorithm, and calculating the coherency of segmented units, the divergence among the segmented units and fitness functions by using deeper semantic information; and carrying out the genetic iteration of the text segmenting process, and determining whether the segmenting process terminates based on the similarity among multi-iteration results or the upper limit of iterations to obtain the global optimal solution for segmenting the texts. Therefore, the invention improves the accuracy for segmenting the short network texts.

Description

Network text segmenting method based on genetic algorithm
Technical field
The present invention relates to a kind of network text segmenting method,, be applicable to cutting apart the short width of cloth text of network particularly based on the network text segmenting method of genetic algorithm.
Background technology
The network text cutting techniques is the important technical that network public-opinion monitoring, network text emotion are analyzed, and helps to find network text mid-deep strata time semantic information.
Document " based on the text segmentation model of multivariate discriminant analysis, software journal, 2007,18 (3), P 555-564 " discloses a kind of method of utilizing word frequency information to carry out text segmentation.This method adopts the multivariate discriminant analysis method, utilize word frequency information to represent text with vector space model, consider that 3 factors such as distance, cutting unit length have defined 4 global assessment functions between cutting unit inner distance, cutting unit, realize global assessment the text segmentation pattern.But,,, can't provide enough word frequency information owing to have the sparse phenomenon of data in the text at the short width of cloth text in the network text; Simultaneously, because word frequency information is the shallow-layer semantic information,, influence the accuracy that similarity is calculated, and then influence text segmentation result's accuracy only according to the similarity between the word frequency computed segmentation unit.
Summary of the invention
At the lower defective of the short width of cloth text segmentation of art methods network accuracy rate, the present invention proposes a kind of network text segmenting method based on genetic algorithm, utilize the Gibbs method of sampling to estimate that the potential Di Li Cray of corpus correspondence distributes (LatentDirichlet allocation, LDA) model, and utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again, the text segmentation process is converted into the multiple-objection optimization process, utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, according to repeatedly the similarity between the iteration result or the iterations upper limit determine whether cutting procedure finishes, obtain the text segmentation globally optimal solution, can improve the short width of cloth text segmentation of network accuracy rate.
Technical scheme of the present invention is: a kind of network text segmenting method based on genetic algorithm is characterized in may further comprise the steps:
(a) utilize Web Spider on network, to collect webpage, by the webpage of collecting is carried out the text pre-service, only keep text message, and adopt the file classification method of naive Bayesian, text message behind the removal noise is classified, and category makes up the expansion corpus;
(b) adopt hierarchy clustering method that the expansion corpus is carried out cluster, the number of definite sub-topics adopts the Gibbs method to estimate the LDA model of corpus, estimate that the parameter that relates to adopts empirical value α=0.01, β=0.01, the burn-in spacing is 2000, the thinning spacing is 100;
(c) text to be split is carried out the text pre-service of participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the feature vocabulary of high frequency vocabulary as text; Again according to HowNet, the similarity between the feature vocabulary that calculates text and the feature vocabulary of expanding corpus, the corpus of choosing similarity maximal value correspondence is the outside corpus of text segmentation; Adopt the LDA model of the Gibbs method of sampling and described expansion corpus correspondence to infer the semantic structure information that text to be split comprises at last, the semantic structure information of deduction comprises the type and the probability of vocabulary in cutting unit of the affiliated sub-topics of vocabulary; The type of sub-topics is used for the expression of text to be split under the vocabulary, is that unit adds up the sub-topics type under each vocabulary with the sentence, and sentence expression is the sub-topics space vector, sentence Sj=s J1s J2... s Jj... s JT, s JjVocabulary belongs to the frequency of sub-topics j among the expression sentence j;
(d) utilize paralleling genetic algorithm to carry out text segmentation, the algorithm coding scheme adopts the binary coding scheme, initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; According to formula
C oh = 1 - Σ n = 1 j 1 k Σ s i ∈ b n Σ l = 1 T ( s il - a nl ) 2
Coherency in the computing semantic paragraph; In the formula,
Figure G2009102191638D0000022
| b n| represent the sentence number that comprises in n the semantic paragraph, a nThe average vector of expression semantic paragraph correspondence, a NtBe t component of this vector;
According to formula
D is = Σ n = 1 j | b n | k Σ l = 1 T ( a nl - c l ) 2
Diversity between the computing semantic paragraph; In the formula,
Figure G2009102191638D0000024
Calculate each individual fitness function value in the genetic iteration according to diversity between coherency in the semantic paragraph and semantic paragraph, computing formula is as follows:
F ( x i ) = | { x j | x j ∈ P t ^ C oh ( x i ) ≥ C oh ( x j ) ^ D is ( x i ) ≥ D is ( x j ) } | | P t | + 1 x i ∈ P ‾ t 1 + Σ x j ∈ P t ‾ ^ C oh ( x j ) ≥ C oh ( x i ) ^ D is ( x j ) ≥ D is ( x i ) F ( x j ) x i ∈ P t
In the formula, P tPopulation is expanded in expression, is used for storing the optimum solution of iteration;
In the population selection course, at first adopt elite's retention strategy, keep the elite's individuality in population and the expansion population, directly enter of future generation the evolution; Adopt the roulette method then, select individuality respectively from population and expansion population, relatively the fitness value of two individualities selects the little individuality of fitness to intersect and mutation operation;
The intersection process adopts the single-point cross method, in order to prevent inbreeding, when Hamming distance between individuality surpasses threshold value, just allows to carry out interlace operation between population and expansion population, and threshold value is set between individuality on average 20% of Hamming distance usually; Similarity self-adaptation according to population is regulated mutation operator; The calculating formula of similarity of population is as follows:
Sim ( P ) = 2 × Σ i ≠ j ^ x i , x j ∈ P Sim ( x i , x j ) | P | × ( | P | - 1 )
Take turns when similarity surpasses threshold value and continues 50, finishing iteration process is then chosen individuality in the expansion population as the result of text segmentation, and in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.
The invention has the beneficial effects as follows: owing to utilize the Gibbs method of sampling to estimate that the potential Di Li Cray of corpus correspondence distributes (Latent Dirichlet allocation, LDA) model, and utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again, the text segmentation process is converted into the multiple-objection optimization process, utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, according to repeatedly the similarity between the iteration result or the iterations upper limit determine whether cutting procedure finishes, obtain the text segmentation globally optimal solution, improved the short width of cloth text segmentation of network accuracy rate.
The accuracy rate of text segmentation is weighed by accuracy and recall rate usually, background technology is removed and is adopted the above attribute of weighing, also utilize P μ value as criterion, by in above-mentioned environment, 50 texts to be split being tested, the method that the present invention relates to is weighed on the attribute at 3 and all is better than background technology, is especially exceeding 15% aspect the P μ value.
Below in conjunction with drawings and Examples the present invention is elaborated.
Description of drawings
Accompanying drawing is the network text segmenting method process flow diagram that the present invention is based on genetic algorithm.
Embodiment
With reference to accompanying drawing, present embodiment is at the target text that themes as " Beijing Olympic ", the language operating specification, and the text length is shorter, and the concrete steps of text segmentation are as follows:
The first step, the search for that Web Spider is set is a vocabulary related with Olympic, utilizes Web Spider to collect webpage on network.Olympic Games theme vocabulary determine to comprise following three steps, 1) many pieces in the artificial text of determining to represent search for, be generally 10~20 pieces; 2) word frequency of noun, verb in the statistics literary composition is chosen the high vocabulary of word frequency and is compiled as descriptor undetermined, and the word frequency threshold value is set to 30; 3) from descriptor undetermined is compiled, manually choose 10~15 vocabulary as theme vocabulary.
Webpage all is a html document, need carry out the text pre-service to the webpage of collecting, and need filter the HTML indications when extracting text message; Except title and text, also comprise many links in the webpage, these links are uncorrelated with the text text, when extracting web page contents, also need to filter these useless links.
Adopt the text binary classification method of naive Bayesian, text behind the removal noise is classified, remove and the incoherent webpage of theme according to classification results, make up topic corpus, Feature Selection can adopt the Feature Selection method of information gain IG, mutual information MI etc.Topic corpus is minimum to comprise 1000 pieces of texts.
In second step, adopt the Gibbs method of sampling to estimate the LDA model of corpus.Gibbs sampling iterative process is carried out according to following formula:
P ( z i = j | z - i , w i ) = n w i - ij + β n * - ij + Wβ · n d i - ij + α n d i - i * + α Σ j = 1 T n w i - ij + β n * - ij + Wβ · n d i - ij + α n d i - i * + α
Wherein,
Figure G2009102191638D0000042
Expression w iCorresponding vocabulary is assigned to the number of times of theme j, n * -ijExpression is assigned to total vocabulary number of theme j,
Figure G2009102191638D0000043
Expression text d iIn be assigned to the vocabulary number of theme j,
Figure G2009102191638D0000044
Expression text d iIn the vocabulary sum, above information all can be added up acquisition from text, statistic processes is not considered current lexical item w i
The process of Gibbs sampling comprised for three steps:
1) iteration is initial, z iBe assigned 1 to the T arbitrary value;
2), calculate w respectively according to formula iBe assigned to the probability of theme 1 to T, get more new term w of maximal value iThe theme distribution state, obtain the next state of markov chain;
3) judge according to the similarity and the burn-in spacing of front and back markov chain whether iteration finishes, then iteration end when similarity surpasses threshold value or reaches the burn-in spacing.
In the Gibbs sampling, adopt hierarchy clustering method to determine the number of sub-topics, other parameters adopt empirical value α=0.01, β=0.01, and burn-in spacing and thinning spacing value respectively are 2000 and 100, iterative process adopts the GibbsLDA++ instrument;
The 3rd step, text to be split is carried out text pre-service such as participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the feature vocabulary of high frequency vocabulary as text.According to HowNet, utilize context relation between adopted unit to calculate similarity between the feature vocabulary of the feature vocabulary of text and each corpus, because therefore " Beijing Olympic " that text to be split and step 1 generate expansion corpus similarity maximum chooses the outside corpus that this corpus is a text segmentation.
The LDA model that adopts the Gibbs method of sampling and step 2 to estimate is inferred the semantic structure information that text to be split comprises, and the semantic structure information of deduction comprises the type of the affiliated sub-topics of vocabulary.Language construction information deduction process is still used the formula in second step, and wherein, di is expression sentence i in the 3rd step, and promptly the vocabulary statistics is a unit with the sentence.
Sub-topics type in the statistics sentence under each vocabulary, constructor theme space vector, sentence Sj=sj1sj2...sjj...sjT, sij represent that vocabulary among the sentence j belongs to the frequency of sub-topics j.
In the 4th step, utilize paralleling genetic algorithm to carry out text segmentation.The algorithm coding scheme adopts the binary coding scheme, initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality, the paragraph minimum length is no less than 3, and the paragraph number is no less than 5.According to formula
C oh = 1 - Σ n = 1 j 1 k Σ s i ∈ b n Σ l = 1 T ( s il - a nl ) 2
Coherency in the computing semantic paragraph.In the formula,
Figure G2009102191638D0000052
| b n| represent the sentence number that comprises in n the semantic paragraph, a nThe average vector of expression semantic paragraph correspondence, a NtBe t component of this vector.
According to formula
D is = Σ n = 1 j | b n | k Σ l = 1 T ( a nl - c l ) 2
Diversity between the computing semantic paragraph.In the formula,
Figure G2009102191638D0000054
According to the fitness function value of diversity calculating genetic algorithm between coherency in the semantic paragraph and semantic paragraph, computing formula is as follows:
F ( x i ) = | { x j | x j ∈ P t ^ C oh ( x i ) ≥ C oh ( x j ) ^ D is ( x i ) ≥ D is ( x j ) } | | P t | + 1 x i ∈ P ‾ t 1 + Σ x j ∈ P t ‾ ^ C oh ( x j ) ≥ C oh ( x i ) ^ D is ( x j ) ≥ D is ( x i ) F ( x j ) x i ∈ P t
In the population selection course, at first adopt elite's retention strategy, the individuality of choosing auto-adaptive function value minimum in population and the expansion population respectively is as the elite, and elite's individuality directly enters of future generation the evolution.Secondly, adopt the roulette method, selection is individual from population and expansion population respectively, and relatively the fitness of two individualities selects the little individuality of fitness to intersect and mutation operation.
Adopt the single-point intersection to finish the intersection process, in order to prevent inbreeding, the individuality that participates in intersecting must belong to different populations, and has only when Hamming distance between individuality surpasses threshold value, just allow to carry out between the two interlace operation, threshold value is set to 20% of average Hamming distance between individuality usually.
According to the adaptive adjustment mutation operator of the similarity of population, the calculating formula of similarity of population is as follows:
Sim ( P ) = 2 × Σ i ≠ j ^ x i , x j ∈ P Sim ( x i , x j ) | P | × ( | P | - 1 )
Wherein,
Figure G2009102191638D0000057
x i, x jTwo individualities in the expression population.Population variation considers whether the variation result satisfies the requirement of segmentation result, and segmentation result requires to filter with initialization of population and requires identically, if do not satisfy, then generates new individuality and replaces variation back individuality.
According to formula
Figure G2009102191638D0000058
Calculate the similarity of optimum individual in the different iteration round expansion populations, take turns when similarity surpasses threshold value and continues 50, then iteration finishes.Choose the result of the individuality of expansion in the population as text segmentation, in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.
The accuracy rate of text segmentation is weighed by accuracy and recall rate usually, background technology is removed and is adopted the above attribute of weighing, also utilize P μ value as criterion, by in above-mentioned environment, 50 texts to be split being tested, the method that the present invention relates to is weighed on the attribute at 3 and all is better than background technology, is especially exceeding 15% aspect the P μ value.

Claims (1)

1. network text segmenting method based on genetic algorithm is characterized in that may further comprise the steps:
(a) utilize Web Spider on network, to collect webpage, by the webpage of collecting is carried out the text pre-service, only keep text message, and adopt the file classification method of naive Bayesian, text message behind the removal noise is classified, and category makes up the expansion corpus;
(b) adopt hierarchy clustering method that the expansion corpus is carried out cluster, the number of definite sub-topics adopts the Gibbs method to estimate the LDA model of corpus, estimate that the parameter that relates to adopts empirical value α=0.01, β=0.01, the burn-in spacing is 2000, the thinning spacing is 100;
(c) text to be split is carried out the text pre-service of participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the feature vocabulary of high frequency vocabulary as text; Again according to HowNet, the similarity between the feature vocabulary that calculates text and the feature vocabulary of expanding corpus, the corpus of choosing similarity maximal value correspondence is the outside corpus of text segmentation; Adopt the LDA model of the Gibbs method of sampling and described expansion corpus correspondence to infer the semantic structure information that text to be split comprises at last, the semantic structure information of deduction comprises the type and the probability of vocabulary in cutting unit of the affiliated sub-topics of vocabulary; The type of sub-topics is used for the expression of text to be split under the vocabulary, is that unit adds up the sub-topics type under each vocabulary with the sentence, and sentence expression is the sub-topics space vector, sentence Sj=s J1s J2... s Jj... s JT, s JjVocabulary belongs to the frequency of sub-topics j among the expression sentence j;
(d) utilize paralleling genetic algorithm to carry out text segmentation, the algorithm coding scheme adopts the binary coding scheme, initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; According to formula
C oh = 1 - Σ n = 1 j 1 k Σ s j ∈ b n Σ l = 1 T ( s il - a nl ) 2
Coherency in the computing semantic paragraph; In the formula,
Figure F2009102191638C0000012
, | b n| represent the sentence number that comprises in n the semantic paragraph, a nThe average vector of expression semantic paragraph correspondence, a NtBe t component of this vector;
According to formula
D is = Σ n = 1 j | b n | k Σ l = 1 T ( a nl - c l ) 2
Diversity between the computing semantic paragraph; In the formula, c l = 1 k Σ i = 1 k s il ;
Calculate each individual fitness function value in the genetic iteration according to diversity between coherency in the semantic paragraph and semantic paragraph, computing formula is as follows:
Figure F2009102191638C0000021
In the formula, P tPopulation is expanded in expression, is used for storing the optimum solution of iteration;
In the population selection course, at first adopt elite's retention strategy, keep the elite's individuality in population and the expansion population, directly enter of future generation the evolution; Adopt the roulette method then, select individuality respectively from population and expansion population, relatively the fitness value of two individualities selects the little individuality of fitness to intersect and mutation operation;
The intersection process adopts the single-point cross method, in order to prevent inbreeding, when Hamming distance between individuality surpasses threshold value, just allows to carry out interlace operation between population and expansion population, and threshold value is set between individuality on average 20% of Hamming distance usually; Similarity self-adaptation according to population is regulated mutation operator; The calculating formula of similarity of population is as follows:
Figure F2009102191638C0000022
Take turns when similarity surpasses threshold value and continues 50, finishing iteration process is then chosen individuality in the expansion population as the result of text segmentation, and in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.
CN2009102191638A 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm Active CN101710333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102191638A CN101710333B (en) 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102191638A CN101710333B (en) 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm

Publications (2)

Publication Number Publication Date
CN101710333A true CN101710333A (en) 2010-05-19
CN101710333B CN101710333B (en) 2012-07-04

Family

ID=42403123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102191638A Active CN101710333B (en) 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm

Country Status (1)

Country Link
CN (1) CN101710333B (en)

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968798A (en) * 2010-09-10 2011-02-09 中国科学技术大学 Community recommendation method based on on-line soft constraint LDA algorithm
CN102024065A (en) * 2011-01-18 2011-04-20 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
CN102439597A (en) * 2011-07-13 2012-05-02 华为技术有限公司 Parameter deducing method, computing device and system based on potential dirichlet model
CN102609407A (en) * 2012-02-16 2012-07-25 复旦大学 Fine-grained semantic detection method of harmful text contents in network
CN102855312A (en) * 2012-08-24 2013-01-02 武汉大学 Domain-and-theme-oriented Web service clustering method
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN103914445A (en) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 Data semantic processing method
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN104281567A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Latent semantic analysis method and system
CN104317579A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for business performance of text document
CN104317785A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Internet paragraph level topic identifying system
WO2015165230A1 (en) * 2014-04-28 2015-11-05 华为技术有限公司 Social contact message monitoring method and device
CN105136714A (en) * 2015-09-06 2015-12-09 河南工业大学 Terahertz spectral wavelength selection method based on genetic algorithm
CN105389306A (en) * 2015-11-02 2016-03-09 国网福建省电力有限公司 Latent semantic analysis based intelligent parsing method for application form
CN105787088A (en) * 2016-03-14 2016-07-20 南京理工大学 Text information classifying method based on segmented encoding genetic algorithm
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
WO2017035922A1 (en) * 2015-09-02 2017-03-09 杨鹏 Online internet topic mining method based on improved lda model
CN106502983A (en) * 2016-10-17 2017-03-15 清华大学 The event driven collapse Gibbs sampling method of implicit expression Di Li Cray model
CN106709011A (en) * 2016-12-26 2017-05-24 武汉大学 Positional concept hierarchy disambiguation calculation method based on spatial locating cluster
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN107239438A (en) * 2016-03-28 2017-10-10 阿里巴巴集团控股有限公司 A kind of document analysis method and device
CN108009151A (en) * 2017-11-29 2018-05-08 深圳中泓在线股份有限公司 Newsletter archive automatic segmentation method and apparatus, server and readable storage medium storing program for executing
CN108038173A (en) * 2017-12-07 2018-05-15 广东工业大学 A kind of Web page classification method, system and a kind of Web page classifying equipment
CN109299239A (en) * 2018-09-29 2019-02-01 福建弘扬软件股份有限公司 ES-based electronic medical record retrieval method
CN109325092A (en) * 2018-11-27 2019-02-12 中山大学 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
CN109918659A (en) * 2019-02-28 2019-06-21 华南理工大学 A method of based on not retaining optimum individual genetic algorithm optimization term vector
CN109977227A (en) * 2019-03-19 2019-07-05 中国科学院自动化研究所 Text feature, system, device based on feature coding
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN111797634A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN112667817A (en) * 2020-12-31 2021-04-16 杭州电子科技大学 Text emotion classification integration system based on roulette attribute selection
CN112988981A (en) * 2021-05-14 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic labeling method based on genetic algorithm
CN113191133A (en) * 2021-04-21 2021-07-30 北京邮电大学 Audio text alignment method and system based on Doc2Vec
CN113366511A (en) * 2020-01-07 2021-09-07 支付宝(杭州)信息技术有限公司 Named entity identification and extraction using genetic programming
CN113673255A (en) * 2021-08-25 2021-11-19 北京市律典通科技有限公司 Text function region splitting method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101287229A (en) * 2008-05-26 2008-10-15 北京捷讯畅达科技发展有限公司 Natural language processing technique and device applying to query by short message service of mobile phone

Cited By (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968798A (en) * 2010-09-10 2011-02-09 中国科学技术大学 Community recommendation method based on on-line soft constraint LDA algorithm
CN102024065B (en) * 2011-01-18 2013-01-02 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
CN102024065A (en) * 2011-01-18 2011-04-20 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
CN102439597B (en) * 2011-07-13 2014-12-24 华为技术有限公司 Parameter deducing method, computing device and system based on potential dirichlet model
WO2012106885A1 (en) * 2011-07-13 2012-08-16 华为技术有限公司 Latent dirichlet allocation-based parameter inference method, calculation device and system
US9213943B2 (en) 2011-07-13 2015-12-15 Huawei Technologies Co., Ltd. Parameter inference method, calculation apparatus, and system based on latent dirichlet allocation model
CN102439597A (en) * 2011-07-13 2012-05-02 华为技术有限公司 Parameter deducing method, computing device and system based on potential dirichlet model
CN102609407A (en) * 2012-02-16 2012-07-25 复旦大学 Fine-grained semantic detection method of harmful text contents in network
CN102609407B (en) * 2012-02-16 2014-10-29 复旦大学 Fine-grained semantic detection method of harmful text contents in network
CN102855312A (en) * 2012-08-24 2013-01-02 武汉大学 Domain-and-theme-oriented Web service clustering method
CN102855312B (en) * 2012-08-24 2013-08-14 武汉大学 Domain-and-theme-oriented Web service clustering method
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN102929937B (en) * 2012-09-28 2015-09-16 福州博远无线网络科技有限公司 Based on the data processing method of the commodity classification of text subject model
CN103365978B (en) * 2013-07-01 2017-03-29 浙江大学 TCM data method for digging based on LDA topic models
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN103914445A (en) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 Data semantic processing method
US10250550B2 (en) 2014-04-28 2019-04-02 Huawei Technologies Co., Ltd. Social message monitoring method and apparatus
WO2015165230A1 (en) * 2014-04-28 2015-11-05 华为技术有限公司 Social contact message monitoring method and device
CN104317785A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Internet paragraph level topic identifying system
CN104281567A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Latent semantic analysis method and system
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN104317579A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for business performance of text document
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN106355628B (en) * 2015-07-16 2019-07-05 中国石油化工股份有限公司 The modification method and system of picture and text knowledge point mask method and device, picture and text mark
WO2017035922A1 (en) * 2015-09-02 2017-03-09 杨鹏 Online internet topic mining method based on improved lda model
CN105136714A (en) * 2015-09-06 2015-12-09 河南工业大学 Terahertz spectral wavelength selection method based on genetic algorithm
CN105136714B (en) * 2015-09-06 2017-10-10 河南工业大学 A kind of tera-hertz spectra Wavelength selecting method based on genetic algorithm
CN105389306A (en) * 2015-11-02 2016-03-09 国网福建省电力有限公司 Latent semantic analysis based intelligent parsing method for application form
CN105787088A (en) * 2016-03-14 2016-07-20 南京理工大学 Text information classifying method based on segmented encoding genetic algorithm
CN105787088B (en) * 2016-03-14 2018-12-07 南京理工大学 A kind of text information classification method based on segment encoding genetic algorithm
CN107239438A (en) * 2016-03-28 2017-10-10 阿里巴巴集团控股有限公司 A kind of document analysis method and device
CN106502983A (en) * 2016-10-17 2017-03-15 清华大学 The event driven collapse Gibbs sampling method of implicit expression Di Li Cray model
CN106502983B (en) * 2016-10-17 2019-05-10 清华大学 The event driven collapse Gibbs sampling method of implicit Di Li Cray model
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN106709011A (en) * 2016-12-26 2017-05-24 武汉大学 Positional concept hierarchy disambiguation calculation method based on spatial locating cluster
CN106709011B (en) * 2016-12-26 2019-07-23 武汉大学 A kind of position concept level resolution calculation method based on space orientation cluster
CN108009151A (en) * 2017-11-29 2018-05-08 深圳中泓在线股份有限公司 Newsletter archive automatic segmentation method and apparatus, server and readable storage medium storing program for executing
CN108038173A (en) * 2017-12-07 2018-05-15 广东工业大学 A kind of Web page classification method, system and a kind of Web page classifying equipment
CN109299239A (en) * 2018-09-29 2019-02-01 福建弘扬软件股份有限公司 ES-based electronic medical record retrieval method
CN109299239B (en) * 2018-09-29 2021-11-23 福建弘扬软件股份有限公司 ES-based electronic medical record retrieval method
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
CN109325092A (en) * 2018-11-27 2019-02-12 中山大学 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information
CN109918659A (en) * 2019-02-28 2019-06-21 华南理工大学 A method of based on not retaining optimum individual genetic algorithm optimization term vector
CN109918659B (en) * 2019-02-28 2023-06-20 华南理工大学 Method for optimizing word vector based on unreserved optimal individual genetic algorithm
CN109977227A (en) * 2019-03-19 2019-07-05 中国科学院自动化研究所 Text feature, system, device based on feature coding
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110110326B (en) * 2019-04-25 2020-10-27 西安交通大学 Text cutting method based on subject information
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN113366511A (en) * 2020-01-07 2021-09-07 支付宝(杭州)信息技术有限公司 Named entity identification and extraction using genetic programming
CN113366511B (en) * 2020-01-07 2022-03-25 支付宝(杭州)信息技术有限公司 Named entity identification and extraction using genetic programming
CN111797634A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN111797634B (en) * 2020-06-04 2023-09-08 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN112667817B (en) * 2020-12-31 2022-05-31 杭州电子科技大学 Text emotion classification integration system based on roulette attribute selection
CN112667817A (en) * 2020-12-31 2021-04-16 杭州电子科技大学 Text emotion classification integration system based on roulette attribute selection
CN113191133A (en) * 2021-04-21 2021-07-30 北京邮电大学 Audio text alignment method and system based on Doc2Vec
CN112988981A (en) * 2021-05-14 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic labeling method based on genetic algorithm
CN113673255A (en) * 2021-08-25 2021-11-19 北京市律典通科技有限公司 Text function region splitting method and device, computer equipment and storage medium
CN113673255B (en) * 2021-08-25 2023-06-30 北京市律典通科技有限公司 Text function area splitting method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN101710333B (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN101710333B (en) Network text segmenting method based on genetic algorithm
CN106844424B (en) LDA-based text classification method
CN100353361C (en) New method of characteristic vector weighting for text classification and its device
Zamani et al. Neural query performance prediction using weak supervision from multiple signals
CN103984681B (en) News event evolution analysis method based on time sequence distribution information and topic model
CN104268197B (en) A kind of industry comment data fine granularity sentiment analysis method
CN102073730B (en) Method for constructing topic web crawler system
Takanobu et al. A Weakly Supervised Method for Topic Segmentation and Labeling in Goal-oriented Dialogues via Reinforcement Learning.
CN105045812A (en) Text topic classification method and system
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN105608200A (en) Network public opinion tendency prediction analysis method
CN102591862A (en) Control method and device of Chinese entity relationship extraction based on word co-occurrence
CN103514183A (en) Information search method and system based on interactive document clustering
CN101980199A (en) Method and system for discovering network hot topic based on situation assessment
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN101714135B (en) Emotional orientation analytical method of cross-domain texts
CN105095183A (en) Text emotional tendency determination method and system
Fitriyani et al. The K-means with mini batch algorithm for topics detection on online news
CN109446423A (en) A kind of Judgment by emotion system and method for news and text
CN106202530A (en) Data processing method and device
CN102436512A (en) Preference-based web page text content control method
Foong et al. Text summarization using latent semantic analysis model in mobile android platform
CN117474126A (en) LLaMa2 big data model design method for initial examination and evaluation of manuscript
Tizhoosh et al. Poetic features for poem recognition: A comparative study

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NANTONG LONGXIANG ELECTRICAL EQUIPMENT CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140814

Address after: 226600 No. 69 Donghai Road, Haian Development Zone, Nantong, Jiangsu

Patentee after: NANTONG LONGXIANG ELECTRIC EQUIPMENT CO., LTD.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University