CN101710317A

CN101710317A - Word partial weight calculating method based on word distribution

Info

Publication number: CN101710317A
Application number: CN200910198890A
Authority: CN
Inventors: 夏天
Original assignee: Shanghai Polytechnic University
Current assignee: Shanghai Polytechnic University
Priority date: 2009-11-17
Filing date: 2009-11-17
Publication date: 2010-05-19

Abstract

The invention discloses a word partial weight calculating method based on word distribution, which comprises the following steps of: (1) calculating a distribution uniformity coefficient of words in a word sequence; (2) calculating a distribution extent coefficient of the words in the word sequence; and (3) calculating the word partial weight based on word distribution. The invention effectively optimizes the current word weight calculating method, improves the accuracy and promotes the research and application of natural language processing.

Description

Word partial weight calculating method based on the speech distribution

Technical field:

The present invention relates to a kind of disposal route of natural language, particularly a kind of computing method of term weighing.

Background technology:

Since the nineties, along with the blast of the network information, people need to obtain accurately information on network.This impels natural language processing to develop rapidly, and the research of natural language processing application technologies such as information retrieval, information filtering, text classification, automatic abstract, question answering system just becomes the focus of Recent study.New models such as support vector machine, vector space model, latent semantic analysis model emerge in an endless stream.

All with the basis that is calculated as of term weighing, term weighing calculates the net result that whether accurately directly affects natural language processing to these new models, as Fig. 1.The quantity of information of the expressed document of each word is different in the document, and we represent the significance level of word with term weighing, have only the weight that calculates each word exactly, just can make the semantic information in the document show more obviously.

Common Weight algorithm, boolean's weight, the feature frequency, TF-IDF, entropys etc. have all considered can words of description to comprise certain factor of quantity of information, as: word frequency, document frequently, the position of speech etc.The weighing computation method that has calculates weight according to the rule of word in single document, is referred to as Word partial weight; What also have calculates weight according to the rule of word in document sets, is referred to as word overall situation weight.

The result that existing term weighing computing method obtain is accurate inadequately, and this will directly influence the result based on the natural language processing model of term weighing algorithm.

Summary of the invention:

The present invention is directed to the existing not accurate enough problem of term weighing computing method, and a kind of Word partial weight calculating method that distributes based on speech is provided, this method can improve the accuracy of calculating term weighing, thereby effectively improves the accuracy rate of corresponding natural language processing model.

In order to achieve the above object, the present invention adopts following technical scheme:

Based on the Word partial weight calculating method that speech distributes, this method comprises the steps:

(1) calculate Word partial weight before, document to be analyzed must be carried out Chinese word segmentation, part-of-speech tagging, remove stop words, pretreatment operation such as information extraction, thereby make document to be analyzed become the word sequence that comprises the document main contents;

(2) the distribution consistency degree coefficient of word in the calculating word sequence;

(3) the distribution range coefficient of word in the calculating word sequence;

(4) calculate the Word partial weight that distributes based on speech.

The present invention who obtains according to technique scheme can effectively optimize present term weighing computing method, improves its accuracy rate, promotes the research and the application of natural language processing.The present invention can make and use based on the natural language processing of term weighing algorithm, obtain better result as information retrieval, text classification, Spam filtering etc.The present invention can make up with other Weight algorithms in actual applications, can obtain higher accuracy rate.

Description of drawings:

Further specify the present invention below in conjunction with the drawings and specific embodiments.

Fig. 1 is the synoptic diagram that concerns of term weighing computing method and each technology of natural language field.

Fig. 2 be in document word distribution consistency degree and term weighing concern synoptic diagram.

Fig. 3 be in document word distribution consistency degree and term weighing concern synoptic diagram.

Fig. 4 is a process flow diagram of the present invention.

Embodiment:

For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.

Word partial weight carries out weight calculation according to the statistical law of word in one piece of document, and it considers to influence in one piece of document some factors of term weighing, as: position of word frequency, speech length, speech or the like.In one piece of document, equally distributed on a large scale word contains the more information amount, and more likely the content with document is relevant; The word that concentrate to distribute among a small circle contains less quantity of information, more may be relevant with certain section content in the literary composition.

This patent is studied the distribution of word in the document, according to " K.Pearson theorem " design distribution consistency degree coefficient and computing method thereof, weighs the distribution situation of speech.Different speech correspondences different distribution consistency degree coefficients in the document, and the value of distribution consistency degree coefficient is big more, and the distribution of speech is even more, and for partial weight, its weight is just big more.

On the other hand, this statistic has only been described the degree of uniformity that word distributes, and this patent also utilizes word distribution range, suitably improves the weight of corresponding words.

Based on above-mentioned principle, a kind of Word partial weight algorithm that distributes based on speech provided by the invention is realized (referring to Fig. 4) as follows:

(1) before the calculating Word partial weight, document to be analyzed must be carried out Chinese word segmentation, part-of-speech tagging, remove stop words, pretreatment operation such as information extraction, thereby the word sequence that document to be analyzed is become comprise the document main contents (it is not given unnecessary details for this area proven technique comparatively herein).

If a certain document has m section, C _mIndividual word after its execution in step (1), has obtained word sequence.Below the distribution consistency degree coefficient asked in j word in the word sequence:

If interval (C _I-1+ 1, C _i) expression the document the i section in C _I-1+ 1 word is to C _iIndividual word, (C wherein ₀=0, i=1,2 ..., m), the total number of word of document is C _mAs seen arbitrary word in the document comprises j word in the word sequence, if its even distribution, then its probability that appears at the i section is

(i=1,2 ..., m), n is this speech actual total degree that occurs in this piece document, v _iBe this speech actual number of times that occurs in the i section of document, then the distribution consistency degree coefficient of j word is:

X ² _j＝f(v ₁，...，v _m，r ₁，...r _m，m，n，a，b)

Wherein, X ² _jFor the distribution consistency degree coefficient of j word in the word sequence,, above-mentioned variable is carried out mathematic(al) manipulation draw according to Principle of Statistics.v ₁..., v _m, r ₁... r _m, m, n are variablees, depend on the statistical conditions of j word in document to be analyzed.A, b is a parameter, relates to the optimum embodiment of this patent, needs to decide according to the concrete application of this patent.

The distribution consistency degree coefficient X in document to be analyzed of j the word that this patent calculates ² _jHave following character: value is big more, and it is just even more to illustrate that j word goes out present condition in this piece document.According to noted earlier, " in one piece of document, equally distributed on a large scale word contains the more information amount, and more likely the content with document is relevant; The word that concentrate to distribute among a small circle contains less quantity of information, more may be relevant with certain section content in the literary composition." (as shown in Figures 2 and 3), that is to say that the uniform more Word partial weight of word distribution is big more.As seen the distribution consistency degree coefficient of this patent calculating tallies with the actual situation.

(3) the distribution range coefficient of word in the calculating word sequence.

In the article to be analyzed, the paragraph sum that the range that word distributes and this word occur, the first and last paragraph that this word occurs are relevant apart from, the total paragraph number of article.According to Principle of Statistics, this patent design distribution range coefficient calculations method is as follows:

As for j word in the word sequence, its distribution range coefficient obtains by following formula:

B _j＝φ(p，m；c，d，e)

Wherein, p, m are variable, and p is the paragraph sum that occurs this word in the document, and m is the document segment number; C, d, e are parameter, relate to the optimum embodiment of this patent, need to decide according to the concrete application of this patent.

(4) calculate the Word partial weight that distributes based on speech.

The Word partial weight that distributes based on speech needs the result of calculation with above-mentioned distribution consistency degree coefficient and distribution range coefficient, according to Principle of Statistics, and the composite design computing method.

As distribution consistency degree coefficient and distribution range coefficient, can calculate the partial weight of j word according to j the word that obtains previously:

Wherein, X ² _j, B _jBe variable, be respectively distribution consistency degree coefficient and distribution range coefficient; F, g, h are parameter, relate to the optimum embodiment of this patent, need to decide according to the concrete application of this patent.

When the present invention is specifically tested, adopt as corpus the present invention is tested in nearly 1 year of the domestic well-known network media of China such as sina, sohu above 1,500,000 pieces of web document, concrete grammar is as follows:

1000 pieces of documents of random choose in surpassing 1,500,000 pieces of web document, content relates to 12 big classes such as news, amusement, automobile, physical culture.Calculate the weight of corresponding word by dual mode, a kind of is manual type: most important 20 words in every piece of document of artificial selection, and mark out weight, the mark process need guarantee the term weighing of every piece of document by 10 different employee's marks, and mean value is as the final term weighing of artificial mark.Another kind of mode is to utilize computing method provided by the invention, boolean's weight, feature frequency, and TF and entropy Weight algorithm calculate the term weighing of corresponding word respectively, and the result that will obtain compares with the result of artificial mark at last.Test result shows the value of the Word partial weight algorithm that distributes based on speech provided by the invention than the more approaching artificial mark of other weighing computation methods.

" Word partial weight calculating method that distributes based on speech " of this patent invention can analyze in the article important word effectively and give suitable weight, this technology is applicable to the application system that relates to information retrieval, semantic matches, as: intelligent searching engine, anti-rubbish mail, garbage information filtering, expert system, information security, text data digging etc.

More than show and described ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that describes in the foregoing description and the instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims

1. the Word partial weight calculating method that distributes based on speech is characterized in that described method comprises the steps:

(1) document to be analyzed is carried out pretreatment operation, make document to be analyzed become the word sequence that comprises the document main contents;

(4) calculate the Word partial weight that distributes based on speech.