CN102053978A - Method and device for extracting subject term from simple sentence - Google Patents

Method and device for extracting subject term from simple sentence Download PDF

Info

Publication number
CN102053978A
CN102053978A CN200910209406XA CN200910209406A CN102053978A CN 102053978 A CN102053978 A CN 102053978A CN 200910209406X A CN200910209406X A CN 200910209406XA CN 200910209406 A CN200910209406 A CN 200910209406A CN 102053978 A CN102053978 A CN 102053978A
Authority
CN
China
Prior art keywords
polynary
keyword
combination
current keyword
polynary combination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910209406XA
Other languages
Chinese (zh)
Other versions
CN102053978B (en
Inventor
姜中博
刘怀军
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN200910209406.XA priority Critical patent/CN102053978B/en
Publication of CN102053978A publication Critical patent/CN102053978A/en
Application granted granted Critical
Publication of CN102053978B publication Critical patent/CN102053978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a device for extracting the subject term from a simple sentence, which belongs to the technical field of subject term extraction. The method comprises the following steps: counting a plurality of key words and a plurality of multielement combinations in a language material, working out the characteristic value of each key word, and determining the sequence of each multielement combination; taking each key word of a simple sentence as the current key word respectively, extracting the multielement combination containing the current key word, calculating the grade of the current key word according to the sequence of the extracted multielement combination, and working out the metric of the current key word according to the characteristic value and the grade of the current key word; and picking out the subject term of the simple sentence from all key words according to the metrics after the metrics of all the key words of the single sentence are obtained. The device comprises a counting module, a characteristic value calculation module, a sequence calculation module, an extraction module, a grade calculation module, a metric calculation module and a subject term selection module. The invention has the advantage that the text message of a simple sentence is fully utilized, so that the accuracy and the efficiency of extracting the subject term from a simple sentence can be improved.

Description

The descriptor extracting method and the device of simple sentence
Technical field
The present invention relates to descriptor extractive technique field, particularly a kind of descriptor extracting method and device of simple sentence.
Background technology
Language material is meant linguistic data, i.e. text message.Usually, the objective full and accurate language evidence that relies on extensive corpus to provide can be engaged in introduction on linguistics research and the exploitation of instructing the natural language information disposal system.Simple sentence is meant the sentence that is made of phrase or single speech.Descriptor be meant represent in the text its content characteristic, speech problem, that play a crucial role can be described.Existing descriptor is extracted the following two kinds of methods that adopt usually: dependency analysis method and TFIDF (Term Frequency Inverse Document Frequency, the contrary document frequency of word frequency) method.
The dependency analysis method is the method for the modified relationship between the various piece in a kind of parsing sentence, and it can analyze the modified relationship between some words in the simple sentence, determines descriptor according to this modified relationship then.But, the dependency analysis method has two shortcomings: one is the stable inadequately of this technology performance in the situation of the real statement of complexity, often can't produce a desired effect, another is exactly not accurate enough according to the definite the possibility of result of modified relationship, because descriptor is qualifier sometimes, but be modificand sometimes.
The TFIDF method is TF (Term Frequency, word frequency) and the DF (Document Frequency, document frequency) that adds up earlier in the language material, obtains IDF by DF, determines descriptor according to the value of IF, DF and IDF.This method relatively is fit to the extraction to the descriptor in the article, and normally TF is high more, and the importance degree of this speech is high more, is that the probability of descriptor is just high more.But when simple sentence was extracted descriptor, this method just had very big defective, because the value of TF often approaches 1 in the simple sentence, can't realize vocabulary in statistical differentiation, so TF has just lost the judgment accuracy to the vocabulary importance degree.
Summary of the invention
In order to overcome the defective of prior art, the embodiment of the invention provides a kind of descriptor extracting method and device of simple sentence.Described technical scheme is as follows:
A kind of descriptor extracting method of simple sentence, described method comprises:
In the language material of collecting in advance, count a plurality of keywords and a plurality of polynary combination;
Calculate the eigenwert of each keyword in described a plurality of keyword;
According to each polynary frequency that occurs in the language material that is combined in described a plurality of polynary combinations, determine described each polynary order that is combined in described a plurality of polynary combination;
With each keyword in the simple sentence respectively as current keyword, in described a plurality of polynary combinations, extract the polynary combination that comprises described current keyword, calculate the grade of described current keyword according to the order of the polynary combination of extracting, according to the eigenwert of described current keyword and the weights of the described current keyword of rating calculation;
In obtaining described simple sentence, behind the weights of all keywords, from all keywords of described simple sentence, select the descriptor of described simple sentence according to the weights that obtain.
Wherein, calculate the eigenwert of each keyword in described a plurality of keyword, specifically comprise:
According to chi amount, information gain, entropy and the document frequency of keyword, calculate the eigenwert of each keyword in described a plurality of keyword.
Wherein,, determine the order of each polynary combination in described a plurality of polynary combination, specifically comprise according to the polynary frequency that occurs in the language material that is combined in:
Calculate each polynary frequency that occurs in the described language material that is combined in described a plurality of polynary combination;
According to the described a plurality of polynary combinations of frequency series arrangement from high to low, with the sequence number of arranging order as polynary combination.
Further, after the described a plurality of polynary combinations of frequency series arrangement from high to low, also comprise:
According to the described a plurality of polynary combinations of frequency preset threshold filtering, obtain the polynary combination that frequency is higher than described frequency threshold, perhaps, begin from a plurality of polynary combination of arranging, to filter out the polynary combination that meets default number from the highest polynary combination of frequency;
Correspondingly, in described a plurality of polynary combinations, extract the polynary combination that comprises described current keyword, be specially:
In the polynary combination that filtration obtains, select the polynary combination that comprises described current keyword.
Wherein, calculate the grade of described current keyword, specifically comprise according to the order of the polynary combination of extracting:
Order to each polynary combination of extracting is got inverse, to all summations reciprocal that obtain, with the result of the summation grade as described current keyword.
Wherein, calculate the grade of described current keyword, specifically comprise according to the order of the polynary combination of extracting:
To each polynary combination of extracting, eigenwert addition with all speech in this polynary combination, the eigenwert of described current keyword and the result of described addition are divided by, the selection tendentiousness that the result that will be divided by occurs in this polynary combination as described current keyword, the order of described selection tendentiousness and this polynary combination is divided by obtains intermediate value;
To the intermediate value summation of all polynary combinations of obtaining, with the result of summation grade as described current keyword.
Wherein, according to the eigenwert of described current keyword and the weights of the described current keyword of rating calculation, specifically comprise:
The eigenwert and the grade of described current keyword are multiplied each other, with the weights of multiplied result as described current keyword.
A kind of descriptor extraction element of simple sentence, described device comprises:
Statistical module is used for counting a plurality of keywords and a plurality of polynary combination at the language material of collecting in advance;
Characteristic value calculating module is used for calculating the eigenwert of described a plurality of each keyword of keyword;
The order computing module is used for according to each polynary frequency that occurs in the language material that is combined in of described a plurality of polynary combinations, determines described each polynary order that is combined in described a plurality of polynary combination;
Extraction module is used for each keyword with simple sentence respectively as current keyword, extracts the polynary combination that comprises described current keyword in the polynary combination that described order computing module obtains;
The rating calculation module is used for the order of the polynary combination that extracts according to described extraction module, calculates the grade of described current keyword;
The weights computing module is used for the eigenwert of the described current keyword that obtains according to described characteristic value calculating module, and the grade of the described current keyword that obtains of described rating calculation module, calculates the weights of described current keyword;
Descriptor is selected module, is used for behind the weights that obtain described all keywords of simple sentence, selects the descriptor of described simple sentence from all keywords of described simple sentence according to the weights that obtain.
Wherein, described characteristic value calculating module specifically comprises:
The eigenvalue calculation unit is used for chi amount, information gain, entropy and document frequency according to keyword, calculates the eigenwert of each keyword in described a plurality of keyword.
Wherein, described order computing module specifically comprises:
The order computing unit is used for calculating each polynary frequency that occurs in the described language material that is combined in of described a plurality of polynary combination, according to the described a plurality of polynary combinations of frequency series arrangement from high to low, with the sequence number of the arranging order as polynary combination.
Further, described order computing module also comprises:
Polynary combination filter element, be used for after described order computing unit is arranged described a plurality of polynary combination, according to the described a plurality of polynary combinations of frequency preset threshold filtering, obtain the polynary combination that frequency is higher than described frequency threshold, perhaps, begin from a plurality of polynary combination of arranging, to filter out the polynary combination that meets default number from the highest polynary combination of frequency.
Wherein, described rating calculation module specifically comprises:
The first estate computing unit, the order that is used for each polynary combination that described extraction module is extracted is got inverse, to all summations reciprocal that obtain, with the result of the summation grade as described current keyword.
Perhaps, described rating calculation module specifically comprises:
The second rating calculation unit, be used for each polynary combination to described extraction module extraction, eigenwert addition with all speech in this polynary combination, the eigenwert of described current keyword and the result of described addition are divided by, the selection tendentiousness that the result that will be divided by occurs in this polynary combination as described current keyword, the order of described selection tendentiousness and this polynary combination is divided by obtains intermediate value; To the intermediate value summation of all polynary combinations of obtaining, with the result of summation grade as described current keyword.
Wherein, described weights computing module specifically comprises:
Weight calculation unit is used for the eigenwert of described current keyword that described characteristic value calculating module is obtained, and the grade of the described current keyword that obtains with described rating calculation module multiplies each other, with the weights of multiplied result as described current keyword.
Said method that the embodiment of the invention provides and device extract descriptor based on the weights of eigenwert and rating calculation keyword, have made full use of the text message of simple sentence itself, have improved accuracy and efficient that the simple sentence descriptor is extracted.Compare with existing dependency analysis method, overcome poor stability, the shortcoming that accuracy is low is compared with existing TFIDF method, is more suitable for extracting in the descriptor of simple sentence, has overcome this method low shortcoming of accuracy when simple sentence is extracted descriptor.
Description of drawings
Fig. 1 is a kind of process flow diagram of descriptor extracting method of the simple sentence that provides of the embodiment of the invention;
Fig. 2 is the another kind of process flow diagram of the descriptor extracting method of the simple sentence that provides of the embodiment of the invention;
Fig. 3 is a kind of structural drawing of descriptor extraction element of the simple sentence that provides of the embodiment of the invention;
Fig. 4 is the another kind of structural drawing of the descriptor extraction element of the simple sentence that provides of the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, embodiment of the present invention is described further in detail below in conjunction with accompanying drawing.
Referring to Fig. 1, the embodiment of the invention provides a kind of descriptor extracting method of simple sentence, comprising:
101: in language material, count a plurality of keywords and a plurality of polynary combination;
102: the eigenwert of calculating each keyword in these a plurality of keywords;
103:, determine the order of each polynary combination in these a plurality of polynary combinations according to the polynary frequency that occurs in the language material that is combined in;
104: with each keyword in the simple sentence respectively as current keyword, in a plurality of polynary combination that 101 obtain, extract the polynary combination that comprises current keyword, calculate the grade of current keyword according to the order of the polynary combination of extracting, according to the eigenwert of current keyword and the weights of the current keyword of rating calculation;
105: in obtaining simple sentence, behind the weights of all keywords, from all keywords, select the descriptor of simple sentence according to the weights that obtain.
In the embodiment of the invention, 101 to 103 carry out when initialization usually, are the step of off-line operation, use when being used for simple sentence extraction descriptor after the result of this execution preserves; 104 to 105 is online step to simple sentence extraction descriptor, and this step is that example describes with a simple sentence, when a plurality of simple sentences are extracted descriptor, can repeat above-mentioned 104~105.
Referring to Fig. 2, the descriptor extracting method detailed process of above-mentioned simple sentence can be specific as follows:
201: in the language material of collecting in advance, count a plurality of keywords and a plurality of polynary combination;
Wherein, language material is to collect in advance, comprises article, simple sentence etc., can collect a large amount of language materials usually to improve the accuracy that descriptor is extracted, and the language material of Sou Jiing can be static language material in advance, also can be dynamic language material.For example, collect language material from the website termly, upgrade the language material of having preserved, thereby make that the language material source of collecting is abundanter, the descriptor that extracts more meets existing requirement.
The embodiment of the invention is not done concrete qualification to the keyword that counts and the number of polynary combination.Preferably, count whole keywords and whole polynary combination in this language material, to improve accuracy.This polynary combination is meant the combination of being made up of a plurality of speech, includes but not limited to: binary combination and/or triple combination.Binary combination is meant the combination of being made up of two speech, and these two speech can be adjacent in language material, also can be non-conterminous, promptly comprise adjacent combination of binary and the non-conterminous combination of binary.Triple combination is meant the combination of being made up of three speech, and any two speech in these three speech can be adjacent in language material, also can be non-conterminous, promptly comprise adjacent combination of ternary and the non-conterminous combination of ternary.And the speech in the polynary combination is a demarcation of location.For example, binary combination Trigger (x, y), wherein, x is first speech, y is second speech, in language material x and y can be adjacent also can be non-conterminous.
202: calculate the eigenwert of each keyword in these a plurality of keywords, in the present embodiment, be specially: according to the chi amount (X of keyword 2), information gain (IG, Information Gain), entropy (Entropy) and document frequency (DF), calculate the eigenwert of each keyword in these a plurality of keywords;
These four kinds of features of above-mentioned chi amount, information gain, entropy and document frequency can be described keyword effectively.
This computation process can be represented by following formula:
Feature(t)=f(X 2(t),IG(t),Entropy(t),DF(t)) (1)
Wherein, t represents keyword, and Feature (t) is the eigenwert of this keyword, f is self-defining eigenwert function or eigenwert algorithm, as with four value additions or multiply each other or the like, certainly, can adopt existing eigenwert algorithm, the embodiment of the invention is not done concrete qualification yet.
203: calculate each polynary frequency that occurs in the language material that is combined in above-mentioned a plurality of polynary combination;
Usually, the polynary frequency that occurs in the language material that is combined in is high more, thinks that then this polynary combination is important more.
204: according to these a plurality of polynary combinations of frequency series arrangement from high to low, with the sequence number of arranging order as polynary combination;
Particularly, the race-card of polynary combination n-gram can be shown Seq N-gram(a 1, a 2.., a i..., a n), wherein, a 1, a 2.., a i..., a nBe n speech among the polynary combination n-gram.When being combined as binary combination Trigger, the order of this binary combination can be expressed as Seq when polynary Trigger(t 1, t 2), wherein, t 1And t 2Be two speech in this binary combination.
Above-mentioned arrangement and the process that obtains order are exemplified below: 5 binary combination are arranged, Trigger1~Trigger5, according to frequency be arranged in order from high to low into: 1500,920,350,20,1, its corresponding sequence number is respectively 1,2,3,4,5, then with these 5 sequence numbers respectively as the order of Trigger1~Trigger5.
205: according to the default above-mentioned a plurality of polynary combinations of rule-based filtering, this rule includes but not limited to: filter according to frequency preset threshold value and default number, specific as follows:
According to these a plurality of polynary combinations of frequency preset threshold filtering, obtain the polynary combination that frequency is higher than this frequency threshold;
Perhaps, begin from the good a plurality of polynary combination of above-mentioned arrangement, filter out the polynary combination that meets default number from the highest polynary combination of frequency.
For example, it is 500 that frequency threshold is set, and then frequency is filtered out greater than 500 polynary combination, and remaining polynary combination is then abandoned need not; Perhaps, it is 700 that number is set, if 1000 polynary combinations are arranged, then from the highest the beginning of frequency 700 the polynary combinations in front is filtered out, and remaining 300 polynary combination is then abandoned need not.
Wherein, frequency preset threshold value and default number can be changed as required at any time.
In this step, filtering result can obtain the whole of these a plurality of polynary combinations, perhaps obtains wherein a part of polynary combination, the polynary combination of this part can for one also can be for a plurality of.
206: a simple sentence is cut into independently keyword, and the keyword that obtains after the cutting can be generally a plurality of for one or more;
Further, can also the keyword after the cutting be filtered, speech nonsensical or that meaning is more weak is filtered out, include but not limited to: stop words and irrelevant part of speech.This stop words is meant insignificant common speech or some symbols, for example, " ", " he ", " with ", " ", " energy " or the like.Should irrelevant part of speech comprise: conjunction, descriptive word, pronoun etc., for example, pronoun has " you ", " I ", " it ", conjunction have " with ", " with " or the like.Then subsequent step carried out in the keyword that obtains after filtering.
207: each keyword that cutting is obtained extracts the polynary combination that comprises current keyword respectively as current keyword in the polynary combination that filtration obtains;
If the keyword to cutting in 206 filters, each keyword after then will filtering in this step is respectively as current keyword.
Wherein, current keyword can be in the optional position in the polynary combination of selecting, as the polynary combination of selecting comprises: current keyword is in the locational binary combination of first speech, current keyword is second locational binary combination of speech, and comprises that current keyword is in i locational polynary combination or the like.For example, the combination of selecting comprises: binary combination (Beijing, film), binary combination (film, show the date) and triple combination (premiere, film, arenas), wherein, current keyword film lays respectively at first position and second position in binary combination, be positioned at second position in triple combination.
208: the order to each polynary combination of extracting in 207 is got inverse, to all summations reciprocal that obtain, with the result of the summation grade as current keyword;
The grade of keyword can show the effect of this speech in simple sentence, the probability of the speech that is the theme, its computing formula is as follows:
Rank ( t ) = Σ i = 1 n Σ 1 Seq n - gram ( a 1 , a 2 , . . . , a i , . . . , a n ) , a i = t - - - ( 2 )
Wherein, t is current keyword, and Rank (t) is the grade of current keyword, a i=t represents i speech a among the polynary combination n-gram iBe current keyword t,
Figure B200910209406XD0000072
All polynary combinations of expression correspondence when t appears at i lexeme and puts, the grade of the current keyword that obtains by the summation of the inverse of its order, because current keyword t can appear at the optional position of polynary combination, comprise 1~n, therefore t is appeared at the grade summation that all positions of polynary combination obtain
Figure B200910209406XD0000073
Expression appears at a to t 1, a 2.., a i... a nAll locational grade summations obtain the grade of current keyword.
Particularly, if the polynary binary combination Trigger that is combined as, then formula (2) can be expressed as:
Rank ( t ) = Σ i = 1 m 1 Seq Trigger ( t , t i ) + Σ j = 1 n 1 Seq Trigger ( t j , t ) - - - ( 3 )
Wherein, t is current keyword, and Rank (t) is the grade of current keyword, Seq Trigger(t, t i) represent that first speech is that current keyword and second speech are t iThe order of binary combination, i=1,2 ..., m, m are the quantity that current keyword t appears at the binary combination of first position,
Figure B200910209406XD0000075
The order that appears at all binary combination of first position for current keyword t is got the summation behind the inverse; Seq Trigger(t j, t) represent that first speech is t jAnd second speech is the order of the binary combination of current keyword t, j=1, and 2 ..., n, n are the quantity that current keyword t appears at the binary combination of second position,
Figure B200910209406XD0000076
The order that appears at all binary combination of second position for current keyword t is got the summation behind the inverse.
If the keyword that appears in 205 in the polynary combination after filtering that obtains after the cutting in 206 is a plurality of, then each keyword is all repeated 207 to 208, obtain the grade of each keyword respectively.
If there is the keyword in the polynary combination that does not appear at after filtering in 205 in 206 after the cutting, the direct grade of this part keyword value that is set to give tacit consent to then, the value of this acquiescence can specify and change as required, and as all being set to 1, the present invention does not do concrete qualification to this.
209: for each keyword in the simple sentence, comprise and appear at the keyword in the above-mentioned polynary combination and do not appear at keyword in the above-mentioned polynary combination, all eigenwert and the grade with this speech multiplies each other, and with the weights of multiplied result as keyword, its formula is specific as follows;
Weight(t)=Feature(t)×Rank(t) (4)
Wherein, t represents keyword, and Weight (t) is the weights of keyword, and usually the weights of keyword are big more, and it is big more to illustrate that this keyword becomes the probability of descriptor.Calculate the mode of weights except above-mentioned eigenwert and grade are multiplied each other, can also adopt alternate manner to calculate weights, the present invention does not do concrete qualification to this.
In the present embodiment, keyword in the simple sentence can be divided into two classes, the first kind is the keyword that appears in the above-mentioned polynary combination that counts, second class is the keyword that does not appear in the above-mentioned polynary combination that counts, can repeat aforesaid operations for the first kind and obtain weights, for second class can keyword the grade value that is set to give tacit consent to and then according to eigenwert and rating calculation weights, final, each keyword all can obtain weights.For example, comprise 10 keywords in the simple sentence, obtain 10 weights after then calculating, keyword is corresponding one by one with weights.
210: in obtaining above-mentioned simple sentence, behind the weights of all keywords, from these all keywords, select the descriptor of this simple sentence according to the weights that obtain.
Wherein, from these all keywords, select the descriptor of this simple sentence, specific as follows:
According to these all keywords of preset weight value threshold filtering, obtain weights and be higher than the descriptor of the keyword of this preset weight value threshold value as above-mentioned simple sentence; Perhaps, begin according to weights order from high to low from these all keywords, filter out the keyword that meets default number descriptor as above-mentioned simple sentence from the highest keyword of weights.For example, the preset weights threshold value is X, and the keyword of weights in all keywords that obtain greater than X filtered out as descriptor, and wherein, the value of X can be provided with as required, and the present invention does not do concrete qualification; Perhaps, default number is N, and all keywords that obtain are sorted from high to low according to weights, the N that comes the front is filtered out as descriptor, and wherein, N is more than or equal to 1, and smaller or equal to the sum of all keywords in the above-mentioned simple sentence, concrete big or small the present invention does not do qualification yet.
In the present embodiment, the descriptor of selecting can be these all keywords, perhaps part keyword, these all keywords can be one, are generally a plurality of, this part keyword can for one also can be for a plurality of.
In said method, further, can also calculate the grade of current keyword in polynary combination according to the selection tendentiousness that keyword occurs, promptly 208 can also replace by following steps:
To each the polynary combination that extracts in 207, eigenwert addition with all speech in this polynary combination, the eigenwert of current keyword and the result of described addition are divided by, the selection tendentiousness that the result that will be divided by occurs in this polynary combination as described current keyword, the order of described selection tendentiousness and this polynary combination is divided by obtains intermediate value, to the intermediate value summation of all polynary combinations of obtaining, with the result of summation grade as current keyword.
Wherein, the selection tendentiousness of keyword is represented the significance level of this keyword in polynary combination, and high more this keyword that shows of its value is important more in polynary combination, and this parameter can embody the difference of other speech on semantic degree in this keyword and the polynary combination.Particularly, calculate the selection tendentiousness that current keyword occurs according to following formula in polynary combination earlier:
Pref ( t , ( a 1 , a 2 , . . . , a i , . . . a n ) ) = Feature ( t ) Feature ( a 1 ) + Feature ( a 2 ) + . . . + Feature ( a i ) + . . . + Feature ( a n ) , a i = t - - - ( 5 )
Wherein, Pref (t, (a 1, a 2..., a i... a n)) appear at the selection tendentiousness that i lexeme of polynary combination put, i.e. a for current keyword t 1, a 2..., a i... a nThe significance level of middle t;
Calculate the grade of current keyword then according to following formula:
Rank ( t ) = Σ i = 1 n Σ 1 Seq n - gram ( a 1 , a 2 , . . . , a i , . . . , a n ) Pref ( t , ( a 1 , a 2 , . . . , a i , . . . , a n ) ) a i = t - - - ( 6 )
Wherein,
Figure B200910209406XD0000093
All polynary combinations of expression correspondence when appearing at i lexeme as t and put are with the order of the selection tendentiousness of current keyword and this polynary combination result that the back sues for peace of being divided by; Because current keyword t can appear at the optional position of polynary combination, comprises 1~n, therefore t is appeared at the grade that all positions of polynary combination obtain and sue for peace, correspondingly, the expression formula in the above-mentioned formula (6)
Figure B200910209406XD0000094
Expression appears at a to t 1, a 2.., a i..., a nAll locational grade summations obtain the grade of current keyword.
Particularly, if the polynary binary combination Trigger that is combined as, then formula (6) can be expressed as:
Rank ( t ) = Σ i = 1 m 1 Seq Trigger ( t , t i ) Pref ( t , ( t , t i ) ) + Σ j = 1 n 1 Seq Trigger ( t j , t ) Pref ( t , ( t j , t ) ) - - - ( 7 )
Wherein, i=1,2 ..., m, m are the quantity that current keyword t appears at the binary combination that first lexeme puts, j=1, and 2 ..., n, n are the quantity that current keyword t appears at the binary combination that second lexeme put, Pref (t, (t, t i)) and Pref (t, (t j, t)) be respectively current keyword t and appear at the selection tendentiousness that second lexeme that selection tendentiousness that first lexeme of binary combination puts and current keyword t appear at binary combination put, computing formula is as follows:
Preference ( t , ( t , t i ) ) = Feature ( t ) Feature ( t ) + Feature ( t i ) - - - ( 8 )
Preference ( t , ( t j , t ) ) = Feature ( t ) Feature ( t j ) + Feature ( t ) - - - ( 9 )
Further, in embodiments of the present invention, in order to prevent owing to the excessive calculating that influences weights of the grade numerical value of keyword, can also be after obtaining the grade of keyword, before the weights that calculate keyword, the grade of keyword is changed, specific as follows:
For the keyword that appears in the above-mentioned polynary combination, the grade of keyword is mapped to a default interval is converted to multiple, should can be provided with as required and change in default interval, as be set to [1.1~1.5], value minimum among the grade Rank of all keywords that are about to obtain (t) is converted to 1.1, value maximum among the grade Rank (t) of all keywords of obtaining is converted to 1.5, and remaining is proportionally changed accordingly; For the keyword that does not appear in the above-mentioned polynary combination, the value that the grade of keyword is set to give tacit consent to is as all using as default 1.
Correspondingly, after converting, calculate the weights of keyword again according to the formula (4) in 209, but the Rank (t) in the formula (4) replaced with when calculating the multiple value after the above-mentioned conversion.
Referring to Fig. 3, the embodiment of the invention also provides a kind of descriptor extraction element of simple sentence, comprising:
Statistical module 301 is used for counting a plurality of keywords and a plurality of polynary combination at language material;
Characteristic value calculating module 302 is used for calculating the eigenwert of a plurality of each keyword of keyword;
Order computing module 303 is used for being combined in the frequency that language material occurs according to polynary, determines the order of each polynary combination in a plurality of polynary combinations;
Extraction module 304 is used for each keyword with simple sentence respectively as current keyword, extracts the polynary combination that comprises current keyword in the polynary combination that order computing module 303 obtains;
Rating calculation module 305 is used for the order according to the polynary combination of extraction module 304 extractions, calculates the grade of current keyword;
Weights computing module 306 is used for the eigenwert of the current keyword that obtains according to characteristic value calculating module 302, and the grade of the current keyword that obtains of rating calculation module 305, calculates the weights of current keyword;
Descriptor is selected module 307, is used for behind the weights that obtain all keywords of simple sentence, selects the descriptor of simple sentence from all keywords according to the weights that obtain.
In the present embodiment, language material is to collect in advance, can be static language material, also can be dynamic language material.
Wherein, these all keywords comprise and appear at the keyword in the above-mentioned polynary combination and do not appear at keyword in the above-mentioned polynary combination, the described keyword that appears in the above-mentioned polynary combination is calculated by weights computing module 306 and can obtain weights, for the described keyword that does not appear in the above-mentioned polynary combination, weights computing module 306 can be taken as its weights the value of acquiescence, the value of acquiescence can be provided with as required, and the present invention does not do concrete qualification.
Referring to Fig. 4, in the present embodiment, characteristic value calculating module 302 can specifically comprise:
Eigenvalue calculation unit 302a is used for calculating the eigenwert of each keyword in a plurality of keywords according to chi amount, information gain, entropy and document frequency.
In the present embodiment, order computing module 303 can specifically comprise:
Order computing unit 303a is used for calculating each polynary frequency that occurs in the language material that is combined in of a plurality of polynary combinations, according to a plurality of polynary combinations of frequency series arrangement from high to low, with the sequence number of the arranging order as polynary combination.Further, order computing module 303 can also comprise:
Polynary combination filter element 303b, be used for after order computing unit 303a arranges a plurality of polynary combinations, according to a plurality of polynary combinations of frequency preset threshold filtering, obtain the polynary combination that frequency is higher than frequency threshold, perhaps, begin from a plurality of polynary combination of arranging, to filter out the polynary combination that meets default number from the highest polynary combination of frequency.Wherein, frequency preset threshold value and default number can be provided with and revise as required.
Wherein, a frequency preset threshold value and a default number average can be provided with as required, and the present invention does not do qualification to its concrete numerical value.
In the present embodiment, rating calculation module 305 can specifically comprise:
The first estate computing unit 305a, the order that is used for each polynary combination that extraction module 304 is extracted is got inverse, to all summations reciprocal that obtain, with the result of the summation grade as current keyword.
Perhaps, rating calculation module 305 specifically comprises:
The second rating calculation unit 305b, be used for each polynary combination to extraction module 304 extractions, eigenwert addition with all speech in this polynary combination, the eigenwert of current keyword and the result of addition are divided by, the selection tendentiousness that the result that will be divided by occurs in this polynary combination as current keyword, the order of selecting tendentiousness with this polynary combination is divided by obtains intermediate value; To the intermediate value summation of all polynary combinations of obtaining, with the result of summation grade as current keyword.
In the present embodiment, weights computing module 306 specifically comprises:
Weight calculation unit 306a is used for the eigenwert of current keyword that characteristic value calculating module 302 is obtained, and the grade of the current keyword that obtains with rating calculation module 305 multiplies each other, with the weights of multiplied result as current keyword.Correspondingly, for the keyword that does not appear in the above-mentioned polynary combination, weight calculation unit 306a can be taken as its weights the value of acquiescence.
In the present embodiment, calculate the mode of weights except adopting eigenwert and grade multiplied each other, can adopt alternate manner to calculate weights yet, the present invention does not do concrete qualification.
Said method that the embodiment of the invention provides and device extract descriptor based on the weights of eigenwert and rating calculation keyword, have made full use of the text message of simple sentence itself, have improved accuracy and efficient that the simple sentence descriptor is extracted.Compare with existing dependency analysis method, overcome poor stability, the shortcoming that accuracy is low is compared with existing TFIDF method, is more suitable for extracting in the descriptor of simple sentence, has overcome this method low shortcoming of accuracy when simple sentence is extracted descriptor.
The all or part of of the technique scheme that the embodiment of the invention provides can be finished by the relevant hardware of programmed instruction, described program can be stored in the storage medium that can read, and this storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (14)

1. the descriptor extracting method of a simple sentence is characterized in that, described method comprises:
In the language material of collecting in advance, count a plurality of keywords and a plurality of polynary combination;
Calculate the eigenwert of each keyword in described a plurality of keyword;
According to each polynary frequency that occurs in the language material that is combined in described a plurality of polynary combinations, determine described each polynary order that is combined in described a plurality of polynary combination;
With each keyword in the simple sentence respectively as current keyword, in described a plurality of polynary combinations, extract the polynary combination that comprises described current keyword, calculate the grade of described current keyword according to the order of the polynary combination of extracting, according to the eigenwert of described current keyword and the weights of the described current keyword of rating calculation;
In obtaining described simple sentence, behind the weights of all keywords, from all keywords of described simple sentence, select the descriptor of described simple sentence according to the weights that obtain.
2. method according to claim 1 is characterized in that, calculates the eigenwert of each keyword in described a plurality of keyword, specifically comprises:
According to square statistic, information gain, entropy and the document frequency of card keyword, calculate the eigenwert of each keyword in described a plurality of keyword.
3. method according to claim 1 is characterized in that, according to the polynary frequency that occurs in the language material that is combined in, determines the order of each polynary combination in described a plurality of polynary combination, specifically comprises:
Calculate each polynary frequency that occurs in the described language material that is combined in described a plurality of polynary combination;
According to the described a plurality of polynary combinations of frequency series arrangement from high to low, with the sequence number of arranging order as polynary combination.
4. method according to claim 3 is characterized in that, after the described a plurality of polynary combinations of frequency series arrangement from high to low, also comprises:
According to the described a plurality of polynary combinations of frequency preset threshold filtering, obtain the polynary combination that frequency is higher than described frequency threshold, perhaps, begin from a plurality of polynary combination of arranging, to filter out the polynary combination that meets default number from the highest polynary combination of frequency;
Correspondingly, in described a plurality of polynary combinations, extract the polynary combination that comprises described current keyword, be specially:
In the polynary combination that filtration obtains, select the polynary combination that comprises described current keyword.
5. method according to claim 1 is characterized in that, calculates the grade of described current keyword according to the order of the polynary combination of extracting, and specifically comprises:
Order to each polynary combination of extracting is got inverse, to all summations reciprocal that obtain, with the result of the summation grade as described current keyword.
6. method according to claim 1 is characterized in that, calculates the grade of described current keyword according to the order of the polynary combination of extracting, and specifically comprises:
To each polynary combination of extracting, eigenwert addition with all speech in this polynary combination, the eigenwert of described current keyword and the result of described addition are divided by, the selection tendentiousness that the result that will be divided by occurs in this polynary combination as described current keyword, the order of described selection tendentiousness and this polynary combination is divided by obtains intermediate value;
To the intermediate value summation of all polynary combinations of obtaining, with the result of summation grade as described current keyword.
7. method according to claim 1 is characterized in that, according to the eigenwert of described current keyword and the weights of the described current keyword of rating calculation, specifically comprises:
The eigenwert and the grade of described current keyword are multiplied each other, with the weights of multiplied result as described current keyword.
8. the descriptor extraction element of a simple sentence is characterized in that, described device comprises:
Statistical module is used for counting a plurality of keywords and a plurality of polynary combination at the language material of collecting in advance;
Characteristic value calculating module is used for calculating the eigenwert of described a plurality of each keyword of keyword;
The order computing module is used for according to each polynary frequency that occurs in the language material that is combined in of described a plurality of polynary combinations, determines described each polynary order that is combined in described a plurality of polynary combination;
Extraction module is used for each keyword with simple sentence respectively as current keyword, extracts the polynary combination that comprises described current keyword in the polynary combination that described order computing module obtains;
The rating calculation module is used for the order of the polynary combination that extracts according to described extraction module, calculates the grade of described current keyword;
The weights computing module is used for the eigenwert of the described current keyword that obtains according to described characteristic value calculating module, and the grade of the described current keyword that obtains of described rating calculation module, calculates the weights of described current keyword;
Descriptor is selected module, is used for behind the weights that obtain described all keywords of simple sentence, selects the descriptor of described simple sentence from all keywords of described simple sentence according to the weights that obtain.
9. device according to claim 8 is characterized in that, described characteristic value calculating module specifically comprises:
The eigenvalue calculation unit is used for chi amount, information gain, entropy and document frequency according to keyword, calculates the eigenwert of each keyword in described a plurality of keyword.
10. device according to claim 8 is characterized in that, described order computing module specifically comprises:
The order computing unit is used for calculating each polynary frequency that occurs in the described language material that is combined in of described a plurality of polynary combination, according to the described a plurality of polynary combinations of frequency series arrangement from high to low, with the sequence number of the arranging order as polynary combination.
11. device according to claim 10 is characterized in that, described order computing module also comprises:
Polynary combination filter element, be used for after described order computing unit is arranged described a plurality of polynary combination, according to the described a plurality of polynary combinations of frequency preset threshold filtering, obtain the polynary combination that frequency is higher than described frequency threshold, perhaps, begin from a plurality of polynary combination of arranging, to filter out the polynary combination that meets default number from the highest polynary combination of frequency.
12. device according to claim 8 is characterized in that, described rating calculation module specifically comprises:
The first estate computing unit, the order that is used for each polynary combination that described extraction module is extracted is got inverse, to all summations reciprocal that obtain, with the result of the summation grade as described current keyword.
13. device according to claim 8 is characterized in that, described rating calculation module specifically comprises:
The second rating calculation unit, be used for each polynary combination to described extraction module extraction, eigenwert addition with all speech in this polynary combination, the eigenwert of described current keyword and the result of described addition are divided by, the selection tendentiousness that the result that will be divided by occurs in this polynary combination as described current keyword, the order of described selection tendentiousness and this polynary combination is divided by obtains intermediate value; To the intermediate value summation of all polynary combinations of obtaining, with the result of summation grade as described current keyword.
14. device according to claim 8 is characterized in that, described weights computing module specifically comprises:
Weight calculation unit is used for the eigenwert of described current keyword that described characteristic value calculating module is obtained, and the grade of the described current keyword that obtains with described rating calculation module multiplies each other, with the weights of multiplied result as described current keyword.
CN200910209406.XA 2009-10-27 2009-10-27 Method and device for extracting subject term from simple sentence Active CN102053978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910209406.XA CN102053978B (en) 2009-10-27 2009-10-27 Method and device for extracting subject term from simple sentence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910209406.XA CN102053978B (en) 2009-10-27 2009-10-27 Method and device for extracting subject term from simple sentence

Publications (2)

Publication Number Publication Date
CN102053978A true CN102053978A (en) 2011-05-11
CN102053978B CN102053978B (en) 2014-04-30

Family

ID=43958315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910209406.XA Active CN102053978B (en) 2009-10-27 2009-10-27 Method and device for extracting subject term from simple sentence

Country Status (1)

Country Link
CN (1) CN102053978B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138537A (en) * 2015-07-08 2015-12-09 上海大学 Self-information based discovery method for co-occurrent topic in interdisciplinary field
CN105335358A (en) * 2015-11-18 2016-02-17 成都优译信息技术有限公司 Method for grade evaluation of linguistic data used in translation system
CN106484768A (en) * 2016-09-09 2017-03-08 天津海量信息技术股份有限公司 The local feature abstracting method of content of text salient region and system
CN106504140A (en) * 2016-11-17 2017-03-15 中知厚德知识产权投资管理(天津)有限公司 The intellectual property data system of various dimensions technology correlation evaluation
CN106779580A (en) * 2016-11-17 2017-05-31 中知厚德知识产权投资管理(天津)有限公司 Multi-level intellectual property data system
CN107577814A (en) * 2017-09-29 2018-01-12 崔昊洋 A kind of registered user's management method and system based on ranking
CN108021545A (en) * 2016-11-03 2018-05-11 北京国双科技有限公司 A kind of case of administration of justice document is by extracting method and device
CN110019677A (en) * 2017-11-30 2019-07-16 南京大学 Microblogging advertisement publishers recognition methods and device based on clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1560762A (en) * 2004-02-26 2005-01-05 上海交通大学 Subject extract method based on word simultaneous occurences frequency
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
US20090259655A1 (en) * 2008-04-10 2009-10-15 Kabushiki Kaisha Toshiba Data creating apparatus and data creating method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1560762A (en) * 2004-02-26 2005-01-05 上海交通大学 Subject extract method based on word simultaneous occurences frequency
US20090259655A1 (en) * 2008-04-10 2009-10-15 Kabushiki Kaisha Toshiba Data creating apparatus and data creating method
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张云涛等: "基于综合方法的文本主题句的自动抽取", 《上海交通大学学报》 *
张其文等: "文本主题的自动提取方法研究与实现", 《计算机工程与设计》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138537A (en) * 2015-07-08 2015-12-09 上海大学 Self-information based discovery method for co-occurrent topic in interdisciplinary field
CN105138537B (en) * 2015-07-08 2018-12-07 上海大学 Interdisciplinary fields co-occurrence motif discovery method based on self-information
CN105335358A (en) * 2015-11-18 2016-02-17 成都优译信息技术有限公司 Method for grade evaluation of linguistic data used in translation system
CN106484768A (en) * 2016-09-09 2017-03-08 天津海量信息技术股份有限公司 The local feature abstracting method of content of text salient region and system
CN106484768B (en) * 2016-09-09 2019-12-31 天津海量信息技术股份有限公司 Local feature extraction method and system for text content saliency region
CN108021545A (en) * 2016-11-03 2018-05-11 北京国双科技有限公司 A kind of case of administration of justice document is by extracting method and device
CN108021545B (en) * 2016-11-03 2021-08-10 北京国双科技有限公司 Case course extraction method and device for judicial writing
CN106504140A (en) * 2016-11-17 2017-03-15 中知厚德知识产权投资管理(天津)有限公司 The intellectual property data system of various dimensions technology correlation evaluation
CN106779580A (en) * 2016-11-17 2017-05-31 中知厚德知识产权投资管理(天津)有限公司 Multi-level intellectual property data system
CN107577814A (en) * 2017-09-29 2018-01-12 崔昊洋 A kind of registered user's management method and system based on ranking
CN110019677A (en) * 2017-11-30 2019-07-16 南京大学 Microblogging advertisement publishers recognition methods and device based on clustering

Also Published As

Publication number Publication date
CN102053978B (en) 2014-04-30

Similar Documents

Publication Publication Date Title
CN102053978B (en) Method and device for extracting subject term from simple sentence
CN103914494B (en) Method and system for identifying identity of microblog user
Cordeiro et al. Predicting the compositionality of nominal compounds: Giving word embeddings a hard time
CN104915446A (en) Automatic extracting method and system of event evolving relationship based on news
EP2657852A1 (en) Method and device for filtering harmful information
CN103399901A (en) Keyword extraction method
CN102663139A (en) Method and system for constructing emotional dictionary
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN103282903A (en) Topic extraction device and program
CN103279478A (en) Method for extracting features based on distributed mutual information documents
Wang et al. Filtering and clustering relations for unsupervised information extraction in open domain
CN100511214C (en) Method and system for abstracting batch single document for document set
CN103577989A (en) Method and system for information classification based on product identification
CN102880631A (en) Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
CN105224604A (en) A kind of microblogging incident detection method based on heap optimization and pick-up unit thereof
Patel et al. Extractive Based Automatic Text Summarization.
CN101887415A (en) Automatic extraction method for text document theme word meaning
Shi et al. Mining chinese reviews
Abid et al. Semi-automatic classification and duplicate detection from human loss news corpus
CN102254011A (en) Method for modeling dynamic multi-document abstracts
Kogilavani et al. Clustering based optimal summary generation using genetic algorithm
Brini et al. An Arabic Question-Answering system for factoid questions
CN106649308B (en) Word segmentation and word library updating method and system
Pak et al. The impact of text representation and preprocessing on author identification
JP6168057B2 (en) Failure occurrence cause extraction device, failure occurrence cause extraction method, and failure occurrence cause extraction program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131104

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518000 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20131104

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Applicant after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: 518000 Guangdong city of Shenzhen province Futian District SEG Science Park 2 East Room 403

Applicant before: Tencent Technology (Shenzhen) Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant