CN101710317A - Word partial weight calculating method based on word distribution - Google Patents
Word partial weight calculating method based on word distribution Download PDFInfo
- Publication number
- CN101710317A CN101710317A CN200910198890A CN200910198890A CN101710317A CN 101710317 A CN101710317 A CN 101710317A CN 200910198890 A CN200910198890 A CN 200910198890A CN 200910198890 A CN200910198890 A CN 200910198890A CN 101710317 A CN101710317 A CN 101710317A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- distribution
- calculating
- partial weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a word partial weight calculating method based on word distribution, which comprises the following steps of: (1) calculating a distribution uniformity coefficient of words in a word sequence; (2) calculating a distribution extent coefficient of the words in the word sequence; and (3) calculating the word partial weight based on word distribution. The invention effectively optimizes the current word weight calculating method, improves the accuracy and promotes the research and application of natural language processing.
Description
Technical field:
The present invention relates to a kind of disposal route of natural language, particularly a kind of computing method of term weighing.
Background technology:
Since the nineties, along with the blast of the network information, people need to obtain accurately information on network.This impels natural language processing to develop rapidly, and the research of natural language processing application technologies such as information retrieval, information filtering, text classification, automatic abstract, question answering system just becomes the focus of Recent study.New models such as support vector machine, vector space model, latent semantic analysis model emerge in an endless stream.
All with the basis that is calculated as of term weighing, term weighing calculates the net result that whether accurately directly affects natural language processing to these new models, as Fig. 1.The quantity of information of the expressed document of each word is different in the document, and we represent the significance level of word with term weighing, have only the weight that calculates each word exactly, just can make the semantic information in the document show more obviously.
Common Weight algorithm, boolean's weight, the feature frequency, TF-IDF, entropys etc. have all considered can words of description to comprise certain factor of quantity of information, as: word frequency, document frequently, the position of speech etc.The weighing computation method that has calculates weight according to the rule of word in single document, is referred to as Word partial weight; What also have calculates weight according to the rule of word in document sets, is referred to as word overall situation weight.
The result that existing term weighing computing method obtain is accurate inadequately, and this will directly influence the result based on the natural language processing model of term weighing algorithm.
Summary of the invention:
The present invention is directed to the existing not accurate enough problem of term weighing computing method, and a kind of Word partial weight calculating method that distributes based on speech is provided, this method can improve the accuracy of calculating term weighing, thereby effectively improves the accuracy rate of corresponding natural language processing model.
In order to achieve the above object, the present invention adopts following technical scheme:
Based on the Word partial weight calculating method that speech distributes, this method comprises the steps:
(1) calculate Word partial weight before, document to be analyzed must be carried out Chinese word segmentation, part-of-speech tagging, remove stop words, pretreatment operation such as information extraction, thereby make document to be analyzed become the word sequence that comprises the document main contents;
(2) the distribution consistency degree coefficient of word in the calculating word sequence;
(3) the distribution range coefficient of word in the calculating word sequence;
(4) calculate the Word partial weight that distributes based on speech.
The present invention who obtains according to technique scheme can effectively optimize present term weighing computing method, improves its accuracy rate, promotes the research and the application of natural language processing.The present invention can make and use based on the natural language processing of term weighing algorithm, obtain better result as information retrieval, text classification, Spam filtering etc.The present invention can make up with other Weight algorithms in actual applications, can obtain higher accuracy rate.
Description of drawings:
Further specify the present invention below in conjunction with the drawings and specific embodiments.
Fig. 1 is the synoptic diagram that concerns of term weighing computing method and each technology of natural language field.
Fig. 2 be in document word distribution consistency degree and term weighing concern synoptic diagram.
Fig. 3 be in document word distribution consistency degree and term weighing concern synoptic diagram.
Fig. 4 is a process flow diagram of the present invention.
Embodiment:
For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.
Word partial weight carries out weight calculation according to the statistical law of word in one piece of document, and it considers to influence in one piece of document some factors of term weighing, as: position of word frequency, speech length, speech or the like.In one piece of document, equally distributed on a large scale word contains the more information amount, and more likely the content with document is relevant; The word that concentrate to distribute among a small circle contains less quantity of information, more may be relevant with certain section content in the literary composition.
This patent is studied the distribution of word in the document, according to " K.Pearson theorem " design distribution consistency degree coefficient and computing method thereof, weighs the distribution situation of speech.Different speech correspondences different distribution consistency degree coefficients in the document, and the value of distribution consistency degree coefficient is big more, and the distribution of speech is even more, and for partial weight, its weight is just big more.
On the other hand, this statistic has only been described the degree of uniformity that word distributes, and this patent also utilizes word distribution range, suitably improves the weight of corresponding words.
Based on above-mentioned principle, a kind of Word partial weight algorithm that distributes based on speech provided by the invention is realized (referring to Fig. 4) as follows:
(1) before the calculating Word partial weight, document to be analyzed must be carried out Chinese word segmentation, part-of-speech tagging, remove stop words, pretreatment operation such as information extraction, thereby the word sequence that document to be analyzed is become comprise the document main contents (it is not given unnecessary details for this area proven technique comparatively herein).
(2) the distribution consistency degree coefficient of word in the calculating word sequence;
If a certain document has m section, C
mIndividual word after its execution in step (1), has obtained word sequence.Below the distribution consistency degree coefficient asked in j word in the word sequence:
If interval (C
I-1+ 1, C
i) expression the document the i section in C
I-1+ 1 word is to C
iIndividual word, (C wherein
0=0, i=1,2 ..., m), the total number of word of document is C
mAs seen arbitrary word in the document comprises j word in the word sequence, if its even distribution, then its probability that appears at the i section is
(i=1,2 ..., m), n is this speech actual total degree that occurs in this piece document, v
iBe this speech actual number of times that occurs in the i section of document, then the distribution consistency degree coefficient of j word is:
X
2 j=f(v
1,...,v
m,r
1,...r
m,m,n,a,b)
Wherein, X
2 jFor the distribution consistency degree coefficient of j word in the word sequence,, above-mentioned variable is carried out mathematic(al) manipulation draw according to Principle of Statistics.v
1..., v
m, r
1... r
m, m, n are variablees, depend on the statistical conditions of j word in document to be analyzed.A, b is a parameter, relates to the optimum embodiment of this patent, needs to decide according to the concrete application of this patent.
The distribution consistency degree coefficient X in document to be analyzed of j the word that this patent calculates
2 jHave following character: value is big more, and it is just even more to illustrate that j word goes out present condition in this piece document.According to noted earlier, " in one piece of document, equally distributed on a large scale word contains the more information amount, and more likely the content with document is relevant; The word that concentrate to distribute among a small circle contains less quantity of information, more may be relevant with certain section content in the literary composition." (as shown in Figures 2 and 3), that is to say that the uniform more Word partial weight of word distribution is big more.As seen the distribution consistency degree coefficient of this patent calculating tallies with the actual situation.
(3) the distribution range coefficient of word in the calculating word sequence.
In the article to be analyzed, the paragraph sum that the range that word distributes and this word occur, the first and last paragraph that this word occurs are relevant apart from, the total paragraph number of article.According to Principle of Statistics, this patent design distribution range coefficient calculations method is as follows:
As for j word in the word sequence, its distribution range coefficient obtains by following formula:
B
j=φ(p,m;c,d,e)
Wherein, p, m are variable, and p is the paragraph sum that occurs this word in the document, and m is the document segment number; C, d, e are parameter, relate to the optimum embodiment of this patent, need to decide according to the concrete application of this patent.
(4) calculate the Word partial weight that distributes based on speech.
The Word partial weight that distributes based on speech needs the result of calculation with above-mentioned distribution consistency degree coefficient and distribution range coefficient, according to Principle of Statistics, and the composite design computing method.
As distribution consistency degree coefficient and distribution range coefficient, can calculate the partial weight of j word according to j the word that obtains previously:
Wherein, X
2 j, B
jBe variable, be respectively distribution consistency degree coefficient and distribution range coefficient; F, g, h are parameter, relate to the optimum embodiment of this patent, need to decide according to the concrete application of this patent.
When the present invention is specifically tested, adopt as corpus the present invention is tested in nearly 1 year of the domestic well-known network media of China such as sina, sohu above 1,500,000 pieces of web document, concrete grammar is as follows:
1000 pieces of documents of random choose in surpassing 1,500,000 pieces of web document, content relates to 12 big classes such as news, amusement, automobile, physical culture.Calculate the weight of corresponding word by dual mode, a kind of is manual type: most important 20 words in every piece of document of artificial selection, and mark out weight, the mark process need guarantee the term weighing of every piece of document by 10 different employee's marks, and mean value is as the final term weighing of artificial mark.Another kind of mode is to utilize computing method provided by the invention, boolean's weight, feature frequency, and TF and entropy Weight algorithm calculate the term weighing of corresponding word respectively, and the result that will obtain compares with the result of artificial mark at last.Test result shows the value of the Word partial weight algorithm that distributes based on speech provided by the invention than the more approaching artificial mark of other weighing computation methods.
" Word partial weight calculating method that distributes based on speech " of this patent invention can analyze in the article important word effectively and give suitable weight, this technology is applicable to the application system that relates to information retrieval, semantic matches, as: intelligent searching engine, anti-rubbish mail, garbage information filtering, expert system, information security, text data digging etc.
More than show and described ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that describes in the foregoing description and the instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.
Claims (1)
1. the Word partial weight calculating method that distributes based on speech is characterized in that described method comprises the steps:
(1) document to be analyzed is carried out pretreatment operation, make document to be analyzed become the word sequence that comprises the document main contents;
(2) the distribution consistency degree coefficient of word in the calculating word sequence;
(3) the distribution range coefficient of word in the calculating word sequence;
(4) calculate the Word partial weight that distributes based on speech.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910198890A CN101710317A (en) | 2009-11-17 | 2009-11-17 | Word partial weight calculating method based on word distribution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910198890A CN101710317A (en) | 2009-11-17 | 2009-11-17 | Word partial weight calculating method based on word distribution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101710317A true CN101710317A (en) | 2010-05-19 |
Family
ID=42403109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910198890A Pending CN101710317A (en) | 2009-11-17 | 2009-11-17 | Word partial weight calculating method based on word distribution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101710317A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376024A (en) * | 2013-08-16 | 2015-02-25 | 交通运输部科学研究院 | Document similarity detecting method based on seed words |
CN105426379A (en) * | 2014-10-22 | 2016-03-23 | 武汉理工大学 | Keyword weight calculation method based on position of word |
CN106598949A (en) * | 2016-12-22 | 2017-04-26 | 北京金山办公软件股份有限公司 | Method and device for confirming contribution degree of words to text |
CN109409622A (en) * | 2017-08-17 | 2019-03-01 | 北京小度信息科技有限公司 | Test method and device |
-
2009
- 2009-11-17 CN CN200910198890A patent/CN101710317A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376024A (en) * | 2013-08-16 | 2015-02-25 | 交通运输部科学研究院 | Document similarity detecting method based on seed words |
CN104376024B (en) * | 2013-08-16 | 2017-12-15 | 交通运输部科学研究院 | A kind of document similarity detection method based on seed words |
CN105426379A (en) * | 2014-10-22 | 2016-03-23 | 武汉理工大学 | Keyword weight calculation method based on position of word |
CN106598949A (en) * | 2016-12-22 | 2017-04-26 | 北京金山办公软件股份有限公司 | Method and device for confirming contribution degree of words to text |
CN106598949B (en) * | 2016-12-22 | 2019-01-04 | 北京金山办公软件股份有限公司 | A kind of determination method and device of word to text contribution degree |
CN109409622A (en) * | 2017-08-17 | 2019-03-01 | 北京小度信息科技有限公司 | Test method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106055673B (en) | A kind of Chinese short text sensibility classification method based on text feature insertion | |
CN104268197B (en) | A kind of industry comment data fine granularity sentiment analysis method | |
CN105512311B (en) | A kind of adaptive features select method based on chi-square statistics | |
CN104915448B (en) | A kind of entity based on level convolutional network and paragraph link method | |
CN109933670B (en) | Text classification method for calculating semantic distance based on combined matrix | |
CN101661513B (en) | Detection method of network focus and public sentiment | |
CN107153658A (en) | A kind of public sentiment hot word based on weighted keyword algorithm finds method | |
CN104679738B (en) | Internet hot words mining method and device | |
CN105488092A (en) | Time-sensitive self-adaptive on-line subtopic detecting method and system | |
CN101630312A (en) | Clustering method for question sentences in question-and-answer platform and system thereof | |
CN109165294A (en) | Short text classification method based on Bayesian classification | |
CN106021410A (en) | Source code annotation quality evaluation method based on machine learning | |
CN100495405C (en) | Hierarchy clustering method of successive dichotomy for document in large scale | |
CN107291886A (en) | A kind of microblog topic detecting method and system based on incremental clustering algorithm | |
CN103034726B (en) | Text filtering system and method | |
CN102411621A (en) | Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN103136359A (en) | Generation method of single document summaries | |
CN109508378A (en) | A kind of sample data processing method and processing device | |
CN102662960A (en) | On-line supervised theme-modeling and evolution-analyzing method | |
CN104050556A (en) | Feature selection method and detection method of junk mails | |
CN108388914A (en) | A kind of grader construction method, grader based on semantic computation | |
CN105787121B (en) | A kind of microblogging event summary extracting method based on more story lines | |
CN107145560A (en) | A kind of file classification method and device | |
CN104111925A (en) | Item recommendation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20100519 |