CN101710333A

CN101710333A - Network text segmenting method based on genetic algorithm

Info

Publication number: CN101710333A
Application number: CN200910219163A
Authority: CN
Inventors: 蔡皖东; 赵煜
Original assignee: Northwestern Polytechnical University
Current assignee: NANTONG LONGXIANG ELECTRIC EQUIPMENT CO., LTD.; Northwestern Polytechnical University
Priority date: 2009-11-26
Filing date: 2009-11-26
Publication date: 2010-05-19
Anticipated expiration: 2029-11-26
Also published as: CN101710333B

Abstract

The invention discloses a network text segmenting method based on the genetic algorithm, used for segmenting short network texts. The method comprises the following steps of: evaluating a Latent Dirichlet allocation (LDA) model corresponding to a corpus by using a Gibbs sampling method, inferring latent topic information using the model, representing texts by using the latent topic information; then transforming a text-segmenting process into a multi-target optimum process by using a parallel genetic algorithm, and calculating the coherency of segmented units, the divergence among the segmented units and fitness functions by using deeper semantic information; and carrying out the genetic iteration of the text segmenting process, and determining whether the segmenting process terminates based on the similarity among multi-iteration results or the upper limit of iterations to obtain the global optimal solution for segmenting the texts. Therefore, the invention improves the accuracy for segmenting the short network texts.

Description

Network text segmenting method based on genetic algorithm

Technical field

The present invention relates to a kind of network text segmenting method,, be applicable to cutting apart the short width of cloth text of network particularly based on the network text segmenting method of genetic algorithm.

Background technology

The network text cutting techniques is the important technical that network public-opinion monitoring, network text emotion are analyzed, and helps to find network text mid-deep strata time semantic information.

Document " based on the text segmentation model of multivariate discriminant analysis, software journal, 2007,18 (3), P 555-564 " discloses a kind of method of utilizing word frequency information to carry out text segmentation.This method adopts the multivariate discriminant analysis method, utilize word frequency information to represent text with vector space model, consider that 3 factors such as distance, cutting unit length have defined 4 global assessment functions between cutting unit inner distance, cutting unit, realize global assessment the text segmentation pattern.But,,, can't provide enough word frequency information owing to have the sparse phenomenon of data in the text at the short width of cloth text in the network text; Simultaneously, because word frequency information is the shallow-layer semantic information,, influence the accuracy that similarity is calculated, and then influence text segmentation result's accuracy only according to the similarity between the word frequency computed segmentation unit.

Summary of the invention

At the lower defective of the short width of cloth text segmentation of art methods network accuracy rate, the present invention proposes a kind of network text segmenting method based on genetic algorithm, utilize the Gibbs method of sampling to estimate that the potential Di Li Cray of corpus correspondence distributes (LatentDirichlet allocation, LDA) model, and utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again, the text segmentation process is converted into the multiple-objection optimization process, utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, according to repeatedly the similarity between the iteration result or the iterations upper limit determine whether cutting procedure finishes, obtain the text segmentation globally optimal solution, can improve the short width of cloth text segmentation of network accuracy rate.

Technical scheme of the present invention is: a kind of network text segmenting method based on genetic algorithm is characterized in may further comprise the steps:

(a) utilize Web Spider on network, to collect webpage, by the webpage of collecting is carried out the text pre-service, only keep text message, and adopt the file classification method of naive Bayesian, text message behind the removal noise is classified, and category makes up the expansion corpus;

(b) adopt hierarchy clustering method that the expansion corpus is carried out cluster, the number of definite sub-topics adopts the Gibbs method to estimate the LDA model of corpus, estimate that the parameter that relates to adopts empirical value α=0.01, β=0.01, the burn-in spacing is 2000, the thinning spacing is 100;

(c) text to be split is carried out the text pre-service of participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the feature vocabulary of high frequency vocabulary as text; Again according to HowNet, the similarity between the feature vocabulary that calculates text and the feature vocabulary of expanding corpus, the corpus of choosing similarity maximal value correspondence is the outside corpus of text segmentation; Adopt the LDA model of the Gibbs method of sampling and described expansion corpus correspondence to infer the semantic structure information that text to be split comprises at last, the semantic structure information of deduction comprises the type and the probability of vocabulary in cutting unit of the affiliated sub-topics of vocabulary; The type of sub-topics is used for the expression of text to be split under the vocabulary, is that unit adds up the sub-topics type under each vocabulary with the sentence, and sentence expression is the sub-topics space vector, sentence Sj=s _J1s _J2... s _Jj... s _JT, s _JjVocabulary belongs to the frequency of sub-topics j among the expression sentence j;

(d) utilize paralleling genetic algorithm to carry out text segmentation, the algorithm coding scheme adopts the binary coding scheme, initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; According to formula

C_{oh} = 1 - Σ_{n = 1}^{j} \frac{1}{k} \underset{s_{i} &Element; b_{n}}{Σ} Σ_{l = 1}^{T} {(s_{il} - a_{nl})}^{2}

Coherency in the computing semantic paragraph; In the formula,

| b _n| represent the sentence number that comprises in n the semantic paragraph, a _nThe average vector of expression semantic paragraph correspondence, a _NtBe t component of this vector;

According to formula

D_{is} = Σ_{n = 1}^{j} \frac{| b_{n} |}{k} Σ_{l = 1}^{T} {(a_{nl} - c_{l})}^{2}

Diversity between the computing semantic paragraph; In the formula,

Calculate each individual fitness function value in the genetic iteration according to diversity between coherency in the semantic paragraph and semantic paragraph, computing formula is as follows:

F (x_{i}) = \{\begin{matrix} \frac{| {x_{j} | x_{j} &Element; P_{t}^C_{oh} (x_{i}) &GreaterEqual; C_{oh} (x_{j})^D_{is} (x_{i}) &GreaterEqual; D_{is} (x_{j})} |}{| P_{t} | + 1} & x_{i} &Element; {\overset{&OverBar;}{P}}_{t} \\ 1 + \underset{x_{j} &Element; \overset{&OverBar;}{P_{t}}^C_{oh} (x_{j}) &GreaterEqual; C_{oh} (x_{i})^D_{is} (x_{j}) &GreaterEqual; D_{is} (x_{i})}{Σ} F (x_{j}) & x_{i} &Element; P_{t} \end{matrix}

In the formula, P _tPopulation is expanded in expression, is used for storing the optimum solution of iteration;

In the population selection course, at first adopt elite's retention strategy, keep the elite's individuality in population and the expansion population, directly enter of future generation the evolution; Adopt the roulette method then, select individuality respectively from population and expansion population, relatively the fitness value of two individualities selects the little individuality of fitness to intersect and mutation operation;

The intersection process adopts the single-point cross method, in order to prevent inbreeding, when Hamming distance between individuality surpasses threshold value, just allows to carry out interlace operation between population and expansion population, and threshold value is set between individuality on average 20% of Hamming distance usually; Similarity self-adaptation according to population is regulated mutation operator; The calculating formula of similarity of population is as follows:

Sim (P) = \frac{2 \times \underset{i &NotEqual; j^x_{i}, x_{j} &Element; P}{Σ} Sim (x_{i}, x_{j})}{| P | \times (| P | - 1)}

Take turns when similarity surpasses threshold value and continues 50, finishing iteration process is then chosen individuality in the expansion population as the result of text segmentation, and in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.

The invention has the beneficial effects as follows: owing to utilize the Gibbs method of sampling to estimate that the potential Di Li Cray of corpus correspondence distributes (Latent Dirichlet allocation, LDA) model, and utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again, the text segmentation process is converted into the multiple-objection optimization process, utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, according to repeatedly the similarity between the iteration result or the iterations upper limit determine whether cutting procedure finishes, obtain the text segmentation globally optimal solution, improved the short width of cloth text segmentation of network accuracy rate.

The accuracy rate of text segmentation is weighed by accuracy and recall rate usually, background technology is removed and is adopted the above attribute of weighing, also utilize P μ value as criterion, by in above-mentioned environment, 50 texts to be split being tested, the method that the present invention relates to is weighed on the attribute at 3 and all is better than background technology, is especially exceeding 15% aspect the P μ value.

Below in conjunction with drawings and Examples the present invention is elaborated.

Description of drawings

Accompanying drawing is the network text segmenting method process flow diagram that the present invention is based on genetic algorithm.

Embodiment

With reference to accompanying drawing, present embodiment is at the target text that themes as " Beijing Olympic ", the language operating specification, and the text length is shorter, and the concrete steps of text segmentation are as follows:

The first step, the search for that Web Spider is set is a vocabulary related with Olympic, utilizes Web Spider to collect webpage on network.Olympic Games theme vocabulary determine to comprise following three steps, 1) many pieces in the artificial text of determining to represent search for, be generally 10～20 pieces; 2) word frequency of noun, verb in the statistics literary composition is chosen the high vocabulary of word frequency and is compiled as descriptor undetermined, and the word frequency threshold value is set to 30; 3) from descriptor undetermined is compiled, manually choose 10～15 vocabulary as theme vocabulary.

Webpage all is a html document, need carry out the text pre-service to the webpage of collecting, and need filter the HTML indications when extracting text message; Except title and text, also comprise many links in the webpage, these links are uncorrelated with the text text, when extracting web page contents, also need to filter these useless links.

Adopt the text binary classification method of naive Bayesian, text behind the removal noise is classified, remove and the incoherent webpage of theme according to classification results, make up topic corpus, Feature Selection can adopt the Feature Selection method of information gain IG, mutual information MI etc.Topic corpus is minimum to comprise 1000 pieces of texts.

In second step, adopt the Gibbs method of sampling to estimate the LDA model of corpus.Gibbs sampling iterative process is carried out according to following formula:

P (z_{i} = j | z_{- i}, w_{i}) = \frac{\frac{{n^{w_{i}}}_{- ij} + β}{{n^{*}}_{- ij} + Wβ} \cdot \frac{{n^{d_{i}}}_{- ij} + α}{{n^{d_{i}}}_{- i^{*}} + α}}{Σ_{j = 1}^{T} \frac{{n^{w_{i}}}_{- ij} + β}{{n^{*}}_{- ij} + Wβ} \cdot \frac{{n^{d_{i}}}_{- ij} + α}{{n^{d_{i}}}_{- i^{*}} + α}}

Wherein,

Expression w _iCorresponding vocabulary is assigned to the number of times of theme j, n ^* _-ijExpression is assigned to total vocabulary number of theme j,

Expression text d _iIn be assigned to the vocabulary number of theme j,

Expression text d _iIn the vocabulary sum, above information all can be added up acquisition from text, statistic processes is not considered current lexical item w _i

The process of Gibbs sampling comprised for three steps:

1) iteration is initial, z _iBe assigned 1 to the T arbitrary value;

2), calculate w respectively according to formula _iBe assigned to the probability of theme 1 to T, get more new term w of maximal value _iThe theme distribution state, obtain the next state of markov chain;

3) judge according to the similarity and the burn-in spacing of front and back markov chain whether iteration finishes, then iteration end when similarity surpasses threshold value or reaches the burn-in spacing.

In the Gibbs sampling, adopt hierarchy clustering method to determine the number of sub-topics, other parameters adopt empirical value α=0.01, β=0.01, and burn-in spacing and thinning spacing value respectively are 2000 and 100, iterative process adopts the GibbsLDA++ instrument;

The 3rd step, text to be split is carried out text pre-service such as participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the feature vocabulary of high frequency vocabulary as text.According to HowNet, utilize context relation between adopted unit to calculate similarity between the feature vocabulary of the feature vocabulary of text and each corpus, because therefore " Beijing Olympic " that text to be split and step 1 generate expansion corpus similarity maximum chooses the outside corpus that this corpus is a text segmentation.

The LDA model that adopts the Gibbs method of sampling and step 2 to estimate is inferred the semantic structure information that text to be split comprises, and the semantic structure information of deduction comprises the type of the affiliated sub-topics of vocabulary.Language construction information deduction process is still used the formula in second step, and wherein, di is expression sentence i in the 3rd step, and promptly the vocabulary statistics is a unit with the sentence.

Sub-topics type in the statistics sentence under each vocabulary, constructor theme space vector, sentence Sj=sj1sj2...sjj...sjT, sij represent that vocabulary among the sentence j belongs to the frequency of sub-topics j.

In the 4th step, utilize paralleling genetic algorithm to carry out text segmentation.The algorithm coding scheme adopts the binary coding scheme, initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality, the paragraph minimum length is no less than 3, and the paragraph number is no less than 5.According to formula

C_{oh} = 1 - Σ_{n = 1}^{j} \frac{1}{k} \underset{s_{i} &Element; b_{n}}{Σ} Σ_{l = 1}^{T} {(s_{il} - a_{nl})}^{2}

Coherency in the computing semantic paragraph.In the formula,

| b _n| represent the sentence number that comprises in n the semantic paragraph, a _nThe average vector of expression semantic paragraph correspondence, a _NtBe t component of this vector.

According to formula

D_{is} = Σ_{n = 1}^{j} \frac{| b_{n} |}{k} Σ_{l = 1}^{T} {(a_{nl} - c_{l})}^{2}

Diversity between the computing semantic paragraph.In the formula,

According to the fitness function value of diversity calculating genetic algorithm between coherency in the semantic paragraph and semantic paragraph, computing formula is as follows:

F (x_{i}) = \{\begin{matrix} \frac{| {x_{j} | x_{j} &Element; P_{t}^C_{oh} (x_{i}) &GreaterEqual; C_{oh} (x_{j})^D_{is} (x_{i}) &GreaterEqual; D_{is} (x_{j})} |}{| P_{t} | + 1} & x_{i} &Element; {\overset{&OverBar;}{P}}_{t} \\ 1 + \underset{x_{j} &Element; \overset{&OverBar;}{P_{t}}^C_{oh} (x_{j}) &GreaterEqual; C_{oh} (x_{i})^D_{is} (x_{j}) &GreaterEqual; D_{is} (x_{i})}{Σ} F (x_{j}) & x_{i} &Element; P_{t} \end{matrix}

In the population selection course, at first adopt elite's retention strategy, the individuality of choosing auto-adaptive function value minimum in population and the expansion population respectively is as the elite, and elite's individuality directly enters of future generation the evolution.Secondly, adopt the roulette method, selection is individual from population and expansion population respectively, and relatively the fitness of two individualities selects the little individuality of fitness to intersect and mutation operation.

Adopt the single-point intersection to finish the intersection process, in order to prevent inbreeding, the individuality that participates in intersecting must belong to different populations, and has only when Hamming distance between individuality surpasses threshold value, just allow to carry out between the two interlace operation, threshold value is set to 20% of average Hamming distance between individuality usually.

According to the adaptive adjustment mutation operator of the similarity of population, the calculating formula of similarity of population is as follows:

Sim (P) = \frac{2 \times \underset{i &NotEqual; j^x_{i}, x_{j} &Element; P}{Σ} Sim (x_{i}, x_{j})}{| P | \times (| P | - 1)}

Wherein,

x _i, x _jTwo individualities in the expression population.Population variation considers whether the variation result satisfies the requirement of segmentation result, and segmentation result requires to filter with initialization of population and requires identically, if do not satisfy, then generates new individuality and replaces variation back individuality.

According to formula

Calculate the similarity of optimum individual in the different iteration round expansion populations, take turns when similarity surpasses threshold value and continues 50, then iteration finishes.Choose the result of the individuality of expansion in the population as text segmentation, in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.

Claims

1. network text segmenting method based on genetic algorithm is characterized in that may further comprise the steps:

C_{oh} = 1 - Σ_{n = 1}^{j} \frac{1}{k} \underset{s_{j} &Element; b_{n}}{Σ} Σ_{l = 1}^{T} {(s_{il} - a_{nl})}^{2}

Coherency in the computing semantic paragraph; In the formula,

, | b _n| represent the sentence number that comprises in n the semantic paragraph, a _nThe average vector of expression semantic paragraph correspondence, a _NtBe t component of this vector;

According to formula

D_{is} = Σ_{n = 1}^{j} \frac{| b_{n} |}{k} Σ_{l = 1}^{T} {(a_{nl} - c_{l})}^{2}

Diversity between the computing semantic paragraph; In the formula,

c_{l} = \frac{1}{k} Σ_{i = 1}^{k} s_{il};