CN101710317A - Word partial weight calculating method based on word distribution - Google Patents

Word partial weight calculating method based on word distribution Download PDF

Info

Publication number
CN101710317A
CN101710317A CN200910198890A CN200910198890A CN101710317A CN 101710317 A CN101710317 A CN 101710317A CN 200910198890 A CN200910198890 A CN 200910198890A CN 200910198890 A CN200910198890 A CN 200910198890A CN 101710317 A CN101710317 A CN 101710317A
Authority
CN
China
Prior art keywords
word
document
distribution
calculating
partial weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910198890A
Other languages
Chinese (zh)
Inventor
夏天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Polytechnic University
Original Assignee
Shanghai Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Polytechnic University filed Critical Shanghai Polytechnic University
Priority to CN200910198890A priority Critical patent/CN101710317A/en
Publication of CN101710317A publication Critical patent/CN101710317A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a word partial weight calculating method based on word distribution, which comprises the following steps of: (1) calculating a distribution uniformity coefficient of words in a word sequence; (2) calculating a distribution extent coefficient of the words in the word sequence; and (3) calculating the word partial weight based on word distribution. The invention effectively optimizes the current word weight calculating method, improves the accuracy and promotes the research and application of natural language processing.

Description

Word partial weight calculating method based on the speech distribution
Technical field:
The present invention relates to a kind of disposal route of natural language, particularly a kind of computing method of term weighing.
Background technology:
Since the nineties, along with the blast of the network information, people need to obtain accurately information on network.This impels natural language processing to develop rapidly, and the research of natural language processing application technologies such as information retrieval, information filtering, text classification, automatic abstract, question answering system just becomes the focus of Recent study.New models such as support vector machine, vector space model, latent semantic analysis model emerge in an endless stream.
All with the basis that is calculated as of term weighing, term weighing calculates the net result that whether accurately directly affects natural language processing to these new models, as Fig. 1.The quantity of information of the expressed document of each word is different in the document, and we represent the significance level of word with term weighing, have only the weight that calculates each word exactly, just can make the semantic information in the document show more obviously.
Common Weight algorithm, boolean's weight, the feature frequency, TF-IDF, entropys etc. have all considered can words of description to comprise certain factor of quantity of information, as: word frequency, document frequently, the position of speech etc.The weighing computation method that has calculates weight according to the rule of word in single document, is referred to as Word partial weight; What also have calculates weight according to the rule of word in document sets, is referred to as word overall situation weight.
The result that existing term weighing computing method obtain is accurate inadequately, and this will directly influence the result based on the natural language processing model of term weighing algorithm.
Summary of the invention:
The present invention is directed to the existing not accurate enough problem of term weighing computing method, and a kind of Word partial weight calculating method that distributes based on speech is provided, this method can improve the accuracy of calculating term weighing, thereby effectively improves the accuracy rate of corresponding natural language processing model.
In order to achieve the above object, the present invention adopts following technical scheme:
Based on the Word partial weight calculating method that speech distributes, this method comprises the steps:
(1) calculate Word partial weight before, document to be analyzed must be carried out Chinese word segmentation, part-of-speech tagging, remove stop words, pretreatment operation such as information extraction, thereby make document to be analyzed become the word sequence that comprises the document main contents;
(2) the distribution consistency degree coefficient of word in the calculating word sequence;
(3) the distribution range coefficient of word in the calculating word sequence;
(4) calculate the Word partial weight that distributes based on speech.
The present invention who obtains according to technique scheme can effectively optimize present term weighing computing method, improves its accuracy rate, promotes the research and the application of natural language processing.The present invention can make and use based on the natural language processing of term weighing algorithm, obtain better result as information retrieval, text classification, Spam filtering etc.The present invention can make up with other Weight algorithms in actual applications, can obtain higher accuracy rate.
Description of drawings:
Further specify the present invention below in conjunction with the drawings and specific embodiments.
Fig. 1 is the synoptic diagram that concerns of term weighing computing method and each technology of natural language field.
Fig. 2 be in document word distribution consistency degree and term weighing concern synoptic diagram.
Fig. 3 be in document word distribution consistency degree and term weighing concern synoptic diagram.
Fig. 4 is a process flow diagram of the present invention.
Embodiment:
For technological means, creation characteristic that the present invention is realized, reach purpose and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.
Word partial weight carries out weight calculation according to the statistical law of word in one piece of document, and it considers to influence in one piece of document some factors of term weighing, as: position of word frequency, speech length, speech or the like.In one piece of document, equally distributed on a large scale word contains the more information amount, and more likely the content with document is relevant; The word that concentrate to distribute among a small circle contains less quantity of information, more may be relevant with certain section content in the literary composition.
This patent is studied the distribution of word in the document, according to " K.Pearson theorem " design distribution consistency degree coefficient and computing method thereof, weighs the distribution situation of speech.Different speech correspondences different distribution consistency degree coefficients in the document, and the value of distribution consistency degree coefficient is big more, and the distribution of speech is even more, and for partial weight, its weight is just big more.
On the other hand, this statistic has only been described the degree of uniformity that word distributes, and this patent also utilizes word distribution range, suitably improves the weight of corresponding words.
Based on above-mentioned principle, a kind of Word partial weight algorithm that distributes based on speech provided by the invention is realized (referring to Fig. 4) as follows:
(1) before the calculating Word partial weight, document to be analyzed must be carried out Chinese word segmentation, part-of-speech tagging, remove stop words, pretreatment operation such as information extraction, thereby the word sequence that document to be analyzed is become comprise the document main contents (it is not given unnecessary details for this area proven technique comparatively herein).
(2) the distribution consistency degree coefficient of word in the calculating word sequence;
If a certain document has m section, C mIndividual word after its execution in step (1), has obtained word sequence.Below the distribution consistency degree coefficient asked in j word in the word sequence:
If interval (C I-1+ 1, C i) expression the document the i section in C I-1+ 1 word is to C iIndividual word, (C wherein 0=0, i=1,2 ..., m), the total number of word of document is C mAs seen arbitrary word in the document comprises j word in the word sequence, if its even distribution, then its probability that appears at the i section is
Figure G2009101988900D0000031
(i=1,2 ..., m), n is this speech actual total degree that occurs in this piece document, v iBe this speech actual number of times that occurs in the i section of document, then the distribution consistency degree coefficient of j word is:
X 2 j=f(v 1,...,v m,r 1,...r m,m,n,a,b)
Wherein, X 2 jFor the distribution consistency degree coefficient of j word in the word sequence,, above-mentioned variable is carried out mathematic(al) manipulation draw according to Principle of Statistics.v 1..., v m, r 1... r m, m, n are variablees, depend on the statistical conditions of j word in document to be analyzed.A, b is a parameter, relates to the optimum embodiment of this patent, needs to decide according to the concrete application of this patent.
The distribution consistency degree coefficient X in document to be analyzed of j the word that this patent calculates 2 jHave following character: value is big more, and it is just even more to illustrate that j word goes out present condition in this piece document.According to noted earlier, " in one piece of document, equally distributed on a large scale word contains the more information amount, and more likely the content with document is relevant; The word that concentrate to distribute among a small circle contains less quantity of information, more may be relevant with certain section content in the literary composition." (as shown in Figures 2 and 3), that is to say that the uniform more Word partial weight of word distribution is big more.As seen the distribution consistency degree coefficient of this patent calculating tallies with the actual situation.
(3) the distribution range coefficient of word in the calculating word sequence.
In the article to be analyzed, the paragraph sum that the range that word distributes and this word occur, the first and last paragraph that this word occurs are relevant apart from, the total paragraph number of article.According to Principle of Statistics, this patent design distribution range coefficient calculations method is as follows:
As for j word in the word sequence, its distribution range coefficient obtains by following formula:
B j=φ(p,m;c,d,e)
Wherein, p, m are variable, and p is the paragraph sum that occurs this word in the document, and m is the document segment number; C, d, e are parameter, relate to the optimum embodiment of this patent, need to decide according to the concrete application of this patent.
(4) calculate the Word partial weight that distributes based on speech.
The Word partial weight that distributes based on speech needs the result of calculation with above-mentioned distribution consistency degree coefficient and distribution range coefficient, according to Principle of Statistics, and the composite design computing method.
As distribution consistency degree coefficient and distribution range coefficient, can calculate the partial weight of j word according to j the word that obtains previously:
Figure G2009101988900D0000041
Wherein, X 2 j, B jBe variable, be respectively distribution consistency degree coefficient and distribution range coefficient; F, g, h are parameter, relate to the optimum embodiment of this patent, need to decide according to the concrete application of this patent.
When the present invention is specifically tested, adopt as corpus the present invention is tested in nearly 1 year of the domestic well-known network media of China such as sina, sohu above 1,500,000 pieces of web document, concrete grammar is as follows:
1000 pieces of documents of random choose in surpassing 1,500,000 pieces of web document, content relates to 12 big classes such as news, amusement, automobile, physical culture.Calculate the weight of corresponding word by dual mode, a kind of is manual type: most important 20 words in every piece of document of artificial selection, and mark out weight, the mark process need guarantee the term weighing of every piece of document by 10 different employee's marks, and mean value is as the final term weighing of artificial mark.Another kind of mode is to utilize computing method provided by the invention, boolean's weight, feature frequency, and TF and entropy Weight algorithm calculate the term weighing of corresponding word respectively, and the result that will obtain compares with the result of artificial mark at last.Test result shows the value of the Word partial weight algorithm that distributes based on speech provided by the invention than the more approaching artificial mark of other weighing computation methods.
" Word partial weight calculating method that distributes based on speech " of this patent invention can analyze in the article important word effectively and give suitable weight, this technology is applicable to the application system that relates to information retrieval, semantic matches, as: intelligent searching engine, anti-rubbish mail, garbage information filtering, expert system, information security, text data digging etc.
More than show and described ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that describes in the foregoing description and the instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims (1)

1. the Word partial weight calculating method that distributes based on speech is characterized in that described method comprises the steps:
(1) document to be analyzed is carried out pretreatment operation, make document to be analyzed become the word sequence that comprises the document main contents;
(2) the distribution consistency degree coefficient of word in the calculating word sequence;
(3) the distribution range coefficient of word in the calculating word sequence;
(4) calculate the Word partial weight that distributes based on speech.
CN200910198890A 2009-11-17 2009-11-17 Word partial weight calculating method based on word distribution Pending CN101710317A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910198890A CN101710317A (en) 2009-11-17 2009-11-17 Word partial weight calculating method based on word distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910198890A CN101710317A (en) 2009-11-17 2009-11-17 Word partial weight calculating method based on word distribution

Publications (1)

Publication Number Publication Date
CN101710317A true CN101710317A (en) 2010-05-19

Family

ID=42403109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910198890A Pending CN101710317A (en) 2009-11-17 2009-11-17 Word partial weight calculating method based on word distribution

Country Status (1)

Country Link
CN (1) CN101710317A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376024A (en) * 2013-08-16 2015-02-25 交通运输部科学研究院 Document similarity detecting method based on seed words
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN106598949A (en) * 2016-12-22 2017-04-26 北京金山办公软件股份有限公司 Method and device for confirming contribution degree of words to text
CN109409622A (en) * 2017-08-17 2019-03-01 北京小度信息科技有限公司 Test method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376024A (en) * 2013-08-16 2015-02-25 交通运输部科学研究院 Document similarity detecting method based on seed words
CN104376024B (en) * 2013-08-16 2017-12-15 交通运输部科学研究院 A kind of document similarity detection method based on seed words
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN106598949A (en) * 2016-12-22 2017-04-26 北京金山办公软件股份有限公司 Method and device for confirming contribution degree of words to text
CN106598949B (en) * 2016-12-22 2019-01-04 北京金山办公软件股份有限公司 A kind of determination method and device of word to text contribution degree
CN109409622A (en) * 2017-08-17 2019-03-01 北京小度信息科技有限公司 Test method and device

Similar Documents

Publication Publication Date Title
CN106055673B (en) A kind of Chinese short text sensibility classification method based on text feature insertion
CN104268197B (en) A kind of industry comment data fine granularity sentiment analysis method
CN105512311B (en) A kind of adaptive features select method based on chi-square statistics
CN104915448B (en) A kind of entity based on level convolutional network and paragraph link method
CN109933670B (en) Text classification method for calculating semantic distance based on combined matrix
CN101661513B (en) Detection method of network focus and public sentiment
CN107153658A (en) A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN104679738B (en) Internet hot words mining method and device
CN105488092A (en) Time-sensitive self-adaptive on-line subtopic detecting method and system
CN101630312A (en) Clustering method for question sentences in question-and-answer platform and system thereof
CN109165294A (en) Short text classification method based on Bayesian classification
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN100495405C (en) Hierarchy clustering method of successive dichotomy for document in large scale
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN103034726B (en) Text filtering system and method
CN102411621A (en) Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN103136359A (en) Generation method of single document summaries
CN109508378A (en) A kind of sample data processing method and processing device
CN102662960A (en) On-line supervised theme-modeling and evolution-analyzing method
CN104050556A (en) Feature selection method and detection method of junk mails
CN108388914A (en) A kind of grader construction method, grader based on semantic computation
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
CN107145560A (en) A kind of file classification method and device
CN104111925A (en) Item recommendation method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100519