CN102081627A

CN102081627A - Method and system for determining contribution degree of word in text

Info

Publication number: CN102081627A
Application number: CN2009102412861A
Authority: CN
Inventors: 张宇峰; 于亮; 王海洲
Original assignee: Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Beijing Kingsoft Office Software Inc
Priority date: 2009-11-27
Filing date: 2009-11-27
Publication date: 2011-06-01
Anticipated expiration: 2029-11-27
Also published as: CN102081627B

Abstract

The embodiment of the invention discloses a method and a system for determining the contribution degree of a word in a text. The method comprises: obtaining a first text, and selecting at least one word from the first text; dividing the first text into at least one text segment; calculating the occurring positions and times in the text segment of the first text; and according to the calculated parameter, calculating the contribution degree of the word for the first text, wherein the parameter comprises the occurring positions and times of the word in the text segment of the first text. In the method provided by the embodiment of the invention, the contribution degree of the word for the first text is calculated according to the occurring positions and times of the word in the text segment of the first text, and the length of the word. Compared with the existing term frequency/inverse document frequency (TF/IDF), the method provided by the embodiment of the invention can truly reflect the contribution degree of the word.

Description

The method and system of the contribution degree of a kind of definite word in text

Technical field

The present invention relates to the information Recognition field, relate in particular to the method and system of the contribution degree of a kind of definite word in text.

Background technology

The progress of Internet development and infotech, the surge that has brought quantity of information makes people be difficult in the information that finds real needs in the immense information.Though appearing at of various search engines solved this problem to a certain extent, the Search Results that various search engines return is generally all very huge, and is unfavorable for that the user finds the information of needs.At this problem, a solution is exactly text automatic classification.One of the maximum characteristics of text automatic classification problem and difficulty are the higher-dimension of feature space and the sparse property that document is represented vector.Seek a kind of computing method of effective word contribution degree, reduce the dimension of feature space, improve the efficient and the precision of classification, become the matter of utmost importance in the text automatic classification.

The inventor is by to the discovering of prior art, when calculating word to the contribution degree of text, mainly adopts the TF/IDF formula in the existing prior art.Between the TF/IDF formula hypothesis feature word is orthogonality relation, and contribution degree and its appearance location independent in text to text of supposition feature word.The factor that existing contribution degree computing method are considered is too simple, can not reflect the contribution degree of word to text truly.

Summary of the invention

In view of this, the embodiment of the invention provides the method and system of the contribution degree of a kind of definite word in text, improves the authenticity of word to the contribution degree of text.

For achieving the above object, the embodiment of the invention provides following technical scheme:

The method of the contribution degree of a kind of definite word in text comprises:

Obtain first text, from described first text, choose at least one word;

This first text is divided at least one text fragments;

Add up position and occurrence number that described word occurs in the text fragments of described first text;

According to the contribution degree of the described word of adding up of calculation of parameter to described first text, described parameter comprises position and the occurrence number that described word occurs in the described text fragments of described first text.

Alternatively, this method also comprises: the length of adding up described word; Described parameter also comprises the length of described word.

Comprise according to the contribution degree of the described word of adding up of calculation of parameter described first text:

Calculate the contribution degree of described field speech keyword according to following formula:

Weight ({CW}_{i}) = \frac{1}{m} Σ_{q = 1}^{m} w_{iq} * f_{iq}

Wherein: CW _iBe i word in described at least one word; Weight (CW _i) be i word CW _iContribution degree; M is the number of the text fragments in described first text; Q is the position of text fragments in text; w _IqIt is i word is positioned at the q position at text fragments contribution degree; f _IqBe i occurrence number that is positioned at the q position at text fragments.

Calculate the contribution degree of described word according to following formula to described first text:

Weight ({CW}_{i}) = \frac{1}{m} Σ_{q = 1}^{m} w_{iq} * f_{iq} * \log_{a} (l_{i} + α),

Wherein: CW _iBe i word in described at least one word; Weight (CW _i) be i word CW _iContribution degree; M is the number of the text fragments in described first text; Q is the position of text fragments in text; w _IqIt is i word is positioned at the q position at text fragments contribution degree; f _IqBeing i is positioned at the occurrence number of q position, l at text fragments _iThe speech that is i field keyword is long; α is for adjusting constant, and a is the adjustable truth of a matter.

Described adjustable truth of a matter a takes from right constant e.

When the position of described file fragment in text is divided into first area and second area, w _IqCalculate by following formula:

w_{iq} = \{\begin{matrix} \ln (\frac{m - q}{m} + β) & q = y, y &Element; 1 ~ m \\ λ^{\sqrt{f_{iq}}} & q = - 1 \end{matrix},

Wherein, when i word is positioned at the first area of described first text, q=-1; Otherwise q=y, y ∈ 1～m, y are the position of text fragments in described first text at i word place; λ is a weighting constant; β is for adjusting constant.

A kind of field speech sort method comprises:

N text in the selection A field is as first text;

In a described N text, choose M field undetermined speech as at least one word, calculate the contribution degree of M field undetermined speech described A field according to any described method in the claim 1～6;

Contribution degree according to described M field undetermined speech sorts to described M field undetermined speech.

Alternatively, this method also comprises:

The field undetermined speech of L position before contribution degree comes in the speech of M field undetermined is defined as the field speech in described A field, and wherein, L is not more than M.

A kind of text evaluation method comprises:

Obtain in the A field one and wait to evaluate Y the keyword of making by oneself in the text and the text;

Field speech in conjunction with described A field carries out match retrieval to text described to be evaluated, and determines X the field speech in the described described A field of waiting to evaluate in the text and being comprised, and the field speech method according to claim 8 in wherein said A field is determined;

If X=0 then directly provides the described underproof result of file that waits to evaluate;

If X is smaller or equal to Y, promptly X field speech is included in the described Y keyword, then directly provides the described qualified review result of file of waiting to evaluate;

Otherwise, wait to evaluate text as first text with described, described X field speech as at least one word, calculated described X field speech to the described contribution degree of waiting to evaluate text according to any described method in the claim 1～6;

Described X field speech sorted to the described contribution degree of evaluating text of waiting according to the every field speech, determine the evaluation parameter according to X field speech, described evaluation parameter and Y keyword are compared, and provide the review result of waiting to evaluate text to described according to comparative result.

Determine the evaluation parameter according to X field speech, described evaluation parameter and Y keyword compared, and provide according to comparative result the described review result of evaluating text of waiting is comprised:

The field speech of choosing Z position before coming from X field speech is as the evaluation parameter;

Calculate the degree of conformity of Y keyword and described evaluation parameter, when described degree of conformity reaches predetermined threshold value, provide the described qualified review result of text of waiting to evaluate; Otherwise provide the described underproof review result of text of waiting to evaluate.

A kind of file classification method comprises

In the A field, choose N text, and method according to claim 8 is determined M the field speech and the contribution degree in the A field thereof in A field;

Form the received text vector in A field according to the contribution degree of M the field speech in A field;

Calculate the contribution degree in its text of living in of M field speech in each text in N the text in described A field according to any described method in the claim 1～6, and form N and contrast text vector;

Calculate the similarity between each contrast text vector and the described received text vector, and determine the text classification similarity threshold according to the N that a calculates similarity;

When treating that to one classified text is carried out text classification, this method comprises:

Calculate the contribution degree of M field speech in treating classified text according to any method that claim provided of claim 1～6, and form the judgement text vector for the treatment of classifying text;

Calculate the similarity between described judgement text vector and the described received text vector;

Described similarity and text classification similarity threshold are compared, determine according to comparative result whether the described classifying text for the treatment of belongs to the A field.

Determine describedly treat whether classifying text belongs to the A field and comprise according to comparative result:

When described similarity is not more than similarity threshold, determine that the described classifying text for the treatment of belongs to the A field; Otherwise the described classifying text for the treatment of does not belong to the A field.

A kind of method of automatic generation text snippet comprises:

Obtain pending text T, described pending text T is carried out participle;

Described pending text T as first text, as at least one word, is calculated in described pending text T each speech to the contribution degree of described pending text T according to any described method of claim 1～6 with the word that obtains behind the participle;

According to contribution degree the word among the described pending text T is sorted, and choose preceding M speech as the summary keyword;

Determine that according to the summary keyword summary candidate forms a complete sentence;

Summary candidate sentence tissue is formed summary.

Determining according to the summary keyword that the candidate set of summary forms a complete sentence comprises:

Determine the sentence at its place according to described summary keyword;

When described summary keyword is arranged in a plurality of sentence, chooses and comprise the maximum sentence of keyword as the summary candidate sentence.

A kind of text estimating and examining system comprises:

Acquiring unit is used for obtaining A field one and waits to evaluate Y the keyword of making by oneself in the text and the text;

Field speech determining unit, be used for text described to be evaluated being carried out match retrieval in conjunction with the field speech in described A field, determine X the field speech in the described described A field of waiting to evaluate in the text and being comprised, the field speech method according to claim 8 in wherein said A field is determined;

The preview unit is used for directly providing the described underproof result of file that waits to evaluate when X=0; When X＞Y, promptly X field speech is included in the described Y keyword, then directly provides the described qualified review result of file of waiting to evaluate;

The evaluation unit, be used for when X does not meet the evaluation condition of evaluation unit, wait to evaluate text as first text with described, described X field speech as at least one word, calculated described X field speech according to any described method in the claim 1～6 and wait to evaluate the text contribution degree to described; Described X field speech waited to evaluate text contribution degree ordering according to the every field speech to described, determine the evaluation parameter according to X field speech, described evaluation parameter and Y keyword are compared, and provide the review result of waiting to evaluate text to described according to comparative result.

Described evaluation unit comprises:

Evaluation calculation of parameter subelement, be used for waiting to evaluate text as first text with described, described X field speech as at least one word, calculated described X field speech according to any described method in the claim 1～6 and wait to evaluate the text contribution degree to described; Described X field speech waited to evaluate text contribution degree ordering according to the every field speech to described, determine to evaluate parameter according to X field speech;

Compare subelement, calculate the degree of conformity of Y keyword and described evaluation parameter, when described degree of conformity reaches predetermined threshold value, provide the described qualified review result of text of waiting to evaluate; Otherwise provide the described underproof review result of text of waiting to evaluate.

A kind of text classification system comprises

Field parameter determining unit is used for choosing in the A field N text, and method according to claim 8 is determined M the field speech and the contribution degree in the A field thereof in A field;

Received text vector generation unit is used for the received text vector according to the contribution degree formation A field of M the field speech in A field;

Contrast text vector generation unit be used for calculating according to any described method of claim 1～6 contribution degree in its text of living in of M field speech in each text in N the text in described A field, and N of formation contrasts text vector;

The similarity threshold determining unit is used to calculate the similarity between each contrast text vector and the described received text vector, and determines the text classification similarity threshold according to the N that a calculates similarity;

Judge vector calculation unit, be used for calculating M field speech and treating the contribution degree of classified text, and form the judgement text vector for the treatment of classifying text according to any method that claim provided of claim 1～6;

Similarity calculated is used to calculate the similarity between described judgement text vector and the described received text vector;

Taxon is used for described similarity and text classification similarity threshold are compared, and determines according to comparative result whether the described classifying text for the treatment of belongs to the A field.

Described taxon comprises: when described similarity is not more than similarity threshold, determine that the described classifying text for the treatment of belongs to the A field; Otherwise the described classifying text for the treatment of does not belong to the A field.

A kind of system of automatic generation text snippet comprises:

The text acquiring unit is used to obtain pending text T, and described pending text T is carried out participle;

The contribution degree computing unit, be used for described pending text T as first text, the word that obtains behind the participle as at least one word, is calculated among the described pending text T each speech to the contribution degree of described pending text T according to any described method of claim 1～6;

The keyword determining unit is used for sorting according to the word of contribution degree to described pending text T, and chooses preceding M speech as the summary keyword;

The candidate sentence determining unit is used for determining that according to the summary keyword summary candidate forms a complete sentence;

Summary forms the unit, is used for summary candidate sentence tissue is formed summary.

Described candidate sentence determining unit comprises:

The locator unit is used for determining according to described summary keyword the sentence at its place;

Determine subelement, be used for when described summary keyword is positioned at a plurality of sentence, choose and comprise the maximum sentence of keyword as the summary candidate sentence.

As seen, in embodiments of the present invention, obtain first text, from described first text, choose at least one word; This first text is divided at least one text fragments; Add up position and occurrence number that described word occurs in the text fragments of described first text; According to the contribution degree of the described word of adding up of calculation of parameter to described first text, described parameter comprises position and the occurrence number that described word occurs in the described text fragments of described first text.The method that the embodiment of the invention provided is by in conjunction with this word of length computation of the position that occurs in the text fragments of word in first text and occurrence number and word itself contribution degree to first text, compared to existing TF/IDF, the embodiment of the invention the method for confession can reflect the contribution degree of word more truly.

Description of drawings

In order to be illustrated more clearly in the technical scheme in the embodiment of the invention, the accompanying drawing of required use is done to introduce simply in will describing the embodiment of the invention below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the process flow diagram of the method that one embodiment of the invention provided;

Fig. 2 is the process flow diagram of the method that another embodiment of the present invention provided;

Fig. 3 is the method flow diagram of a kind of application of the method that one embodiment of the invention provided;

The method flow diagram that Fig. 4 uses for the another kind of the method that one embodiment of the invention provided;

The method flow diagram that Fig. 5 uses for the another kind of the method that one embodiment of the invention provided;

Fig. 6 is the another middle method flow diagram of using of the method that one embodiment of the invention provided;

Fig. 7 is the structural representation of the system that one embodiment of the invention provided;

Fig. 8 is the structural representation of another system that one embodiment of the invention provided;

Fig. 9 is the structural representation of the another system that one embodiment of the invention provided.

Embodiment

Fig. 1 is the method for the contribution degree of a kind of definite word in text that one embodiment of the invention provided, and comprising:

S101, obtain first text, from described first text, choose at least one word;

The embodiment of the invention is used for determining the contribution degree of word at text, for convenience of description, represents text information with first text in the embodiment of the invention.This first text can be any one piece of article or one group of written historical materials being made up of many pieces of articles, or even a text library.The form of this first text also can be various, for example can be webpage, paper or the like, and the present invention does not do qualification to this.

This word can be not specific, any one word in first text.

Need to prove, for processing speed and efficient after having obtained first text, can carry out pre-service to this first text, comprise participle, remove stop words, go operations such as punctuation mark.

S102, this first text is divided at least one text fragments;

Text fragments is the subdivision in this first text.When this first text is one piece of article, text fragments one section content in can corresponding this article.When this first text was a slice paper, text fragment can be the summary in this paper, each paragraph in the text; When this first text was a webpage, text fragment can be content of each link correspondence in the webpage or the like.The concrete form of text fragments determines that according to the concrete form of first text the present invention does not do qualification to this.

Position and occurrence number that S103, the described word of statistics occur in the text fragments of described first text;

S104, according to the described word of calculation of parameter of statistics to the contribution degree of described first text, described parameter comprises position and the occurrence number that described word occurs in the described text fragments of described first text.

In one embodiment of the invention, when the position that in the text fragments of described first text, occurs for this word of parameter of statistics and occurrence number, can be according to the contribution degree of the described field of formula 1 calculating speech keyword:

Weight ({CW}_{i}) = \frac{1}{m} Σ_{q = 1}^{m} w_{iq} * f_{iq}

Formula 1

Alternatively, in another embodiment of the present invention, when adding up the parameter of this word, also comprise the length of adding up described word, the parameter of this word also comprises the length of described word so.

When the parameter of statistics comprise position that this word occurs, occurrence number and this word length in the text fragments of described first text time, calculate the contribution degree of described word according to formula 2 to described first text:

Weight ({CW}_{i}) = \frac{1}{m} Σ_{q = 1}^{m} w_{iq} * f_{iq} * \log_{a} (l_{i} + α)

Formula 2

Wherein, adjustable truth of a matter a can adjust setting according to actual conditions, and preferably, it is natural constant e that a can be set.

In formula 1 and formula 2, parameter w _IqWhat represent is the contribution degree that this word is subjected to position influence.In an embodiment of the present invention, in the time of the position of described file fragment in text can being divided into first area and second area, w _IqThrough type 3 calculates:

w_{iq} = \{\begin{matrix} \ln (\frac{m - q}{m} + β) & q = y, y &Element; 1 ~ m \\ λ^{\sqrt{f_{iq}}} & q = - 1 \end{matrix}

Formula 3

Find out that from this example in the embodiment of the invention, along with the difference in text fragments residing zone in first text, the word that is positioned at is wherein also got different values by the contribution degree of determining positions.In the embodiment of the invention first text is divided into two zones of first area and second area, a slice paper for example can be with the summary part as the first area, with other part as second area.Certainly, in other embodiments, for the needs that word is handled, text fragments can be divided into more zone, the present invention does not do qualification to this yet.

The method that the embodiment of the invention provided is by in conjunction with this word of length computation of the position that occurs in the text fragments of word in first text and occurrence number and word itself contribution degree to first text, compared to existing TF/IDF, the method that the embodiment of the invention provided can reflect the contribution degree of word more truly.

Referring to Fig. 2, one embodiment of the invention also provides a kind of field speech sort method, comprising:

N text in S201, the selection A field is as first text;

Here in order to express easily, represent any one field, as being computer realm, medical domain or the like with the A field.Need to prove that N text must be the text in the A field, promptly when the A field was computer realm, N text was N text of computer realm.

S202, in a described N text, choose M field undetermined speech, calculate the contribution degree of M field undetermined speech described A field as at least one word;

Particularly, can repeat no more according to the contribution degree of the method M that embodiment provided of Fig. 1 correspondence field undetermined speech the A field herein.

S203, described M field undetermined speech sorted according to the contribution degree of described M field undetermined speech.

Further, the described method of Fig. 2 also comprises:

S204, the field undetermined speech of L position before contribution degree comes in the speech of M field undetermined is defined as the field speech in described A field, wherein, L is not more than M.

The method that the embodiment of the invention provided is determined the contribution degree of field undetermined speech to field under it according to the length of position, occurrence number even the word of field undetermined speech in text self, and be that the order of each field undetermined speech in the field determined on the basis, thereby bring great convenience for field under understanding it in conjunction with field undetermined speech with the contribution degree.

In addition, in conjunction with the ordering of each field undetermined speech contribution degree in field under it, can further determine the field value in this field, for the follow-up study to this field provides a great convenience.

The embodiment of the invention also provides based on word to the contribution degree of text and according to the multiple application of the contribution degree of text including words and phrases, below, in conjunction with the accompanying drawings each application is described in detail.

Referring to Fig. 3, the embodiment of the invention provided a kind ofly is applied as a kind of text evaluation method, comprising:

S301, obtain in the A field one and wait to evaluate Y the keyword of making by oneself in the text and the text;

S302, text described to be evaluated is carried out match retrieval, determine X the field speech in the described described A field of waiting to evaluate in the text and being comprised in conjunction with the field speech in described A field.

The field speech in wherein said A field can be determined according to the method that embodiment provides of Fig. 2 correspondence, promptly in advance the M in A field field undetermined speech is sorted, and the individual field undetermined of L speech is as the field speech in A field before choosing then.Detailed process can repeat no more with reference to process shown in Figure 2 herein.

S303, if X=0, then directly provide the described underproof result of file that waits to evaluate;

S304, if X＜Y, promptly X field speech is included in the described Y keyword, then directly provides the described qualified review result of file of waiting to evaluate; Otherwise enter step S305.

That is, if this file to be evaluated is a computer realm, keyword wherein is 5, and finds X=3 by match retrieval, and these three speech all are comprised in these 5 keywords, then directly provides and waits to evaluate the qualified result of text.

S305, wait to evaluate text as first text, described X field speech as at least one word, calculated the contribution degree of described X field speech to wait to evaluate text described;

Particularly, can calculate the contribution degree of described X field speech according to the method that embodiment provided of Fig. 1 correspondence to wait to evaluate text.

Example above the continuity if carry out S305, then illustrates by match retrieval and finds X＞5, for example, find promptly to wait X=20 to evaluate text self and to have proposed 5 keywords by match retrieval, and find wherein to comprise the field keyword of 20 computer realms by match retrieval.So just according to the computing method of the word contribution degree that the embodiment of the invention provided, the contribution degree of evaluating text treated in the field keyword that calculates these 20 computer realms in conjunction with text to be evaluated.

S306, described X field speech waited to evaluate text contribution degree ordering according to the every field speech to described, determine the evaluation parameter according to X field speech, described evaluation parameter and Y keyword are compared, and provide the review result of waiting to evaluate text to described according to comparative result.

Particularly, determine the evaluation parameter, described evaluation parameter and Y keyword compared, and provide according to comparative result the described review result of evaluating text of waiting is comprised according to X field speech:

S1, from X field speech, choose come before the Z position the field speech as the evaluation parameter;

The degree of conformity of S2, Y keyword of calculating and described evaluation parameter when described degree of conformity reaches predetermined threshold value, provides the described qualified review result of text of waiting to evaluate; Otherwise provide the described underproof review result of text of waiting to evaluate.

Wherein, Y keyword promptly evaluated in the parameter by relatively more definite Z with the degree of conformity of evaluation parameter and comprised several keywords.As previously mentioned, if Z gets 10, promptly evaluating parameter is 10, by relatively finding have 4 to appear in the evaluation parameter in 5 keywords, degree of conformity is 4/5 so, if default degree of conformity threshold value is 60%, to wait to evaluate document just qualified for this piece of writing so, can provide qualified result.

Existing text evaluation generally all is manually to carry out, the reviewer need could evaluate text for one time to major general's text browsing, workload is big, efficient is low, the method that the embodiment of the invention provided, according to waiting that evaluating the described field of text determines the wherein field speech of appearance, and determine the contribution degree of every field speech according to the contribution degree computing method that the embodiment of the invention provided in conjunction with waiting to evaluate text to this evaluation text, after according to the size of contribution degree the every field speech being sorted, therefrom select the evaluation parameter, the keyword that it and this text to be evaluated is set up on their own compares, thereby provides the review result that this waits to evaluate text.The embodiment of the invention is according to waiting to evaluate the described field of text and treating the evaluation text in conjunction with the information of text self assurance of its content is evaluated, in a large amount of text evaluations, can be used as the method for text preview, text is carried out preliminary classification, the very big efficient that has improved existing text evaluation.

Referring to Fig. 4, the embodiment of the invention also provides a kind of file classification method, comprises

S401, in the A field, choose N text, determine M the field speech and the contribution degree in the A field thereof in A field;

Wherein, determine that M the field speech in A field and the detailed process of the contribution degree in the A field thereof can repeat no more with reference to the embodiment of figure 1 and Fig. 2 correspondence herein.

S402, form the received text vector in A field according to the contribution degree of M the field speech in A field;

This received text vector is a M dimensional vector, comprises M element, and the value of element is the contribution degree of M field speech to the A field.With the computer realm is example, suppose to have in the computer realm 5 field speech, be M=5, these 5 field speech are respectively 0.8,0.1,0.4,0.9,0.7 to the contribution degree of computer realm, and the received text of the formed computer realm of contribution degree of these 5 field speech vector is [0.8,0.1,0.4,0.9,0.7] so.

The contribution degree in its text of living in of M field speech in each text in N the text in S403, the described A of calculating field, and form N contrast text vector;

In the practical application, be not all to comprise M field speech in each text in N the text, for the field speech that does not have in the text to occur, its contribution degree to text is designated as zero.

Example above the continuity, supposing does not have the 3rd field speech in the text in N the text, and its contrast text vector is [x so ₁, x ₂, 0, x ₄, x ₅], x _iBe the contribution degree of i field speech to the text.

The computation process of contribution degree is calculated according to the computing method of contribution degree in text of the word among Fig. 1, repeats no more herein.

S404, calculate the similarity between each contrast text vector and described received text vector, and determine the text classification similarity threshold according to the N that a calculates similarity.

This similarity in the embodiment of the invention is the cosine value of two vector angles.The text classification similarity threshold is definite according to N similarity, for example, can be definite by N similarity averaged, also can determine that the present invention does not do qualification to this by other compute mode.

The work of preparing in advance of text classification is realized in S401～404th, and its purpose is exactly will obtain the text in A field is carried out classified text classification similarity threshold.

Referring to Fig. 5, when treating that to one in the A field classified text is carried out text classification, this method comprises:

S501, the contribution degree of M field speech of calculating in treating classifying text;

Particularly, calculate M the contribution degree of field in treating classifying text according to the method for the contribution degree of definite word in text shown in Figure 1.Treat that classifying text is first text in this method, M field speech is at least one word in this method, and concrete calculating can repeat no more with reference to the embodiment of figure 1 correspondence herein.

Need to prove that when certain the field speech in M the field speech is not comprised in when treating in the classifying text, the contribution degree that classification field speech treated in this field speech is designated as zero.

S502, become to treat the judgement text vector of classifying text according to M field morphology;

Formation and the received text vector of judging text vector are similar, are M dimensional vectors, comprise M element, and the value of each element is treated the contribution degree of classifying text for its corresponding field speech.

Need to prove that M the position of field speech in each text vector is identical, promptly same field speech is the identical position of correspondence in each text vector.With the computer realm is example, one of them field speech is a computing machine, if this field speech of computing machine is corresponding first position in the received text vector, its is judging text vector and is contrasting also corresponding first position in the text vector that promptly the record of first element in these three text vectors all is the contribution degree of this each text of speech of computing machine so.

Preferably, for fear of because of the influence of the difference of text size, after obtaining text vector, can also carry out normalized to text vector to each text vector.Particularly, the Euclidean distance that can utilize each vector element is carried out normalized as the denominator of each vector element to each text vector.For example, to received text vector [0.8,0.1,0.4,0.9,0.7], its Euclidean distance is

Then the received text vector after the normalization is [0.8/1.45 0.1/1.45 0.4/1.45 0.9/1.45 0.7/1.45]=[0.55,0.07,0.28,0.62,0.48].For the contrast text vector with judge that text vector also can normalized, concrete grammar repeats no more with identical to the normalized of received text vector herein.

Similarity between S503, the described judgement text vector of calculating and the described received text vector;

Need to prove, adopt when calculating the similarity between described judgement text vector and the described received text vector with S404 in identical method.

S504, described similarity and text classification similarity threshold are compared, determine according to comparative result whether the described classifying text for the treatment of belongs to the A field.

When this similarity is the cosine value of two vector angles, determine describedly treat whether classifying text belongs to the A field and comprise according to comparative result:

The embodiment of the invention is a fundamental element with the contribution degree of field speech in text, calculate the received text class vector and the contrast text classification vector of classification respectively, calculate the text classification threshold value thus, the contribution degree for the treatment of classifying text by the field value determines to treat the judgement text vector of classifying text, will judge then that similarity between text vector and the received text vector and text classification threshold value compare to provide the classification results for the treatment of classifying text.And when calculating each text classification vector, take into full account the length of position, occurrence number and the word of each field speech itself, thereby reflected the relation for the treatment of between classifying text and its field to be belonged to comprehensively and truly.

Also provide a kind of method of automatic generation text snippet referring to Fig. 6, the embodiment of the invention, comprising:

S601, obtain pending text T, described pending text T is carried out participle;

S602, with described pending text T as first text, the word that obtains behind the participle as at least one word, is calculated among the described pending text T each speech to the contribution degree of described pending text T;

Each speech can repeat no more referring to the embodiment of Fig. 1 correspondence the concrete computation process of the contribution degree of described pending text T herein among the pending text T.

S603, the word among the described pending text T is sorted according to contribution degree, and before choosing M speech as the keyword of making a summary;

S604, according to the summary keyword determine that summary candidate forms a complete sentence;

Particularly, determining according to the summary keyword that summary candidate forms a complete sentence comprises:

S1, determine the sentence at its place according to described summary keyword;

S2, when described summary keyword is arranged in a plurality of sentence, choose and comprise the maximum sentence of keyword as the summary candidate sentence.

For example, chosen preceding 5 speech, so these 5 have been reverted in the article, found the sentence of these 5 speech correspondences as the summary keyword.4 sentences of finding that a speech correspondence has wherein been arranged at this moment, can all be chosen these four sentences, also can choose the summary candidate sentence from these 4 sentences, for example select the maximum sentence of the field speech that comprises in these four sentences as the summary candidate sentence.

S605, summary candidate sentence tissue is formed summary.

Form the process of making a summary by the summary candidate sentence, can be connected to form summary with this in original text, can certainly from the summary candidate sentence, further select part summary candidate sentence according to being connected to form summary smoothly according to the order that each summary candidate sentence occurs.Concrete generation type the present invention does not do qualification.

The method that the embodiment of the invention provided, the position that in text, occurs according to the speech in the text, the length of occurrence number even word self is calculated the contribution degree of word to text, determine summary keyword in the text according to this contribution degree then, find the summary candidate sentence according to summary keyword position in the text, and by summary candidate sentence formation text snippet, this summary forming process need not artificial participation, the formation that should make a summary simultaneously realizes according to the contribution degree of the word in the text to text fully, and the calculating of word contribution degree has taken into full account the occurrence number that comprises word, the information such as length that position even word self occur, can reflect the contribution degree of word truly, thereby make the formed summary of method that the embodiment of the invention provided can reflect the content of text more truly text.

Referring to Fig. 7, one embodiment of the invention also provides a kind of text estimating and examining system, also comprises:

Acquiring unit 701 is used for obtaining A field one and waits to evaluate Y the keyword of making by oneself in the text and the text;

Field speech determining unit 702 is used in conjunction with the field speech in described A field text described to be evaluated being carried out match retrieval, determines X the field speech in the described described A field of waiting to evaluate in the text and being comprised;

Wherein, the field root in described A field is determined according to the method that embodiment provided of Fig. 2 correspondence, is repeated no more herein.

Preview unit 703 is used for directly providing the described underproof result of file that waits to evaluate when X=0; When X＞Y, promptly X field speech is included in the described Y keyword, then directly provides the described qualified review result of file of waiting to evaluate;

Evaluation unit 704 is used for waiting to evaluate text as first text with described when X does not meet the evaluation condition of evaluation unit, and described X field speech as at least one word, calculated described X field speech and wait to evaluate the text contribution degree to described; Described X field speech waited to evaluate text contribution degree ordering according to the every field speech to described, determine the evaluation parameter according to X field speech, described evaluation parameter and Y keyword are compared, and provide the review result of waiting to evaluate text to described according to comparative result.

Particularly, can calculate described X field speech according to the method that embodiment provided of Fig. 1 correspondence and wait to evaluate the text contribution degree described.

Particularly, described evaluation unit 704 comprises:

Evaluation calculation of parameter subelement is used for waiting to evaluate text as first text with described, and described X field speech as at least one word, calculated described X field speech and wait to evaluate the text contribution degree to described; Described X field speech waited to evaluate text contribution degree ordering according to the every field speech to described, determine to evaluate parameter according to X field speech;

Subelement relatively, be used for from X field speech choose come before the field speech of Z position as evaluating parameter; Calculate the degree of conformity of Y keyword and described evaluation parameter, when described degree of conformity reaches predetermined threshold value, provide the described qualified review result of text of waiting to evaluate; Otherwise provide the described underproof review result of text of waiting to evaluate.

The system that the embodiment of the invention provided, according to waiting that evaluating the described field of text determines the wherein field speech of appearance, and determine the contribution degree of every field speech according to the contribution degree computing method that the embodiment of the invention provided in conjunction with waiting to evaluate text to this evaluation text, after according to the size of contribution degree the every field speech being sorted, therefrom select the evaluation parameter, the keyword that it and this text to be evaluated is set up on their own compares, thereby provides the review result that this waits to evaluate text.The embodiment of the invention is according to waiting to evaluate the described field of text and treating the evaluation text in conjunction with the information of text self assurance of its content is evaluated, in a large amount of text evaluations, can be used as the method for text preview, text is carried out preliminary classification, greatly improved the efficient of existing text evaluation.

Referring to Fig. 8, one embodiment of the invention also provides a kind of text classification system, and this system comprises:

Field parameter determining unit 801 is used for choosing in the A field N text, and M field speech in definite A field and the contribution degree in the A field thereof;

Particularly, the corresponding embodiment that can see figures.1.and.2 determines M the field speech and the contribution degree in the A field thereof in A field.

Received text vector generation unit 802 is used for the received text vector according to the contribution degree formation A field of M the field speech in A field;

Contrast text vector generation unit 803 be used for calculating the contribution degree in its text of living in of M field speech in each text of N text in described A field, and N of formation contrasts text vector;

Particularly, the corresponding embodiment that can see figures.1.and.2 calculates the contribution degree in its text of living in of M field speech in each text in N the text in described A field.

Similarity threshold determining unit 804 is used to calculate the similarity between each contrast text vector and the described received text vector, and determines the text classification similarity threshold according to the N that a calculates similarity;

Judge text vector computing unit 805, be used for when classifying text is treated in acquisition one, calculating the contribution degree of M field speech in treating classified text, and form the judgement text vector for the treatment of classifying text;

The contribution degree of M field speech in treating classified text that can calculate particularly, with reference to the embodiment of Fig. 1 correspondence

Similarity calculated 806 is used to calculate the similarity between described judgement text vector and the described received text vector;

Taxon 807 is used for described similarity and text classification similarity threshold are compared, and determines according to comparative result whether the described classifying text for the treatment of belongs to the A field.

When described similarity threshold determining unit was being calculated the similarity that contrasts between text vector and the received text vector, when this similarity was the cosine value of two vector angles, described taxon 807 was specially:

Referring to Fig. 9, one embodiment of the invention also provides a kind of system of automatic generation text snippet, and this system comprises:

Text acquiring unit 901 is used to obtain pending text T, and described pending text T is carried out participle;

Contribution degree computing unit 902 is used for described pending text T as first text, and the word that obtains behind the participle as at least one word, is calculated among the described pending text T each speech to the contribution degree of described pending text T;

Particularly, can calculate among the described pending text T each speech to the contribution degree of described pending text T according to the method that embodiment provided of Fig. 1 correspondence.

Keyword determining unit 903 is used for sorting according to the word of contribution degree to described pending text T, and chooses preceding M speech as the summary keyword;

Candidate sentence determining unit 904 is used for determining that according to the summary keyword summary candidate forms a complete sentence;

Summary forms unit 905, is used for summary candidate sentence tissue is formed summary.

Particularly, described candidate sentence determining unit 904 comprises:

The system that the embodiment of the invention provided, the position that in text, occurs according to the speech in the text, the length of occurrence number even word self is calculated the contribution degree of word to text, determine summary keyword in the text according to this contribution degree then, find the summary candidate sentence according to summary keyword position in the text, and by summary candidate sentence formation text snippet, this summary forming process need not artificial participation, the formation that should make a summary simultaneously realizes according to the contribution degree of the word in the text to text fully, and the calculating of word contribution degree has taken into full account the occurrence number that comprises word, the information such as length that position even word self occur, can reflect the contribution degree of word truly, thereby make the formed summary of method that the embodiment of the invention provided can reflect the content of text more truly text.

The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the method for the contribution degree of a definite word in text is characterized in that, comprising:

Obtain first text, from described first text, choose at least one word;

This first text is divided at least one text fragments;

2. method according to claim 1 is characterized in that, also comprises: the length of adding up described word; Described parameter also comprises the length of described word.

3. method according to claim 1 is characterized in that, comprises according to the contribution degree of the described word of adding up of calculation of parameter to described first text:

Weight ({CW}_{i}) = \frac{1}{m} Σ_{q = 1}^{m} w_{iq} * f_{iq}

4. method according to claim 2 is characterized in that, comprises according to the contribution degree of the described word of adding up of calculation of parameter to described first text:

Weight ({CW}_{i}) = \frac{1}{m} Σ_{q = 1}^{m} w_{iq} * f_{iq} * \log_{a} (l_{i} + α),

5. method according to claim 4 is characterized in that, described adjustable truth of a matter a takes from right constant e.

6. according to any described method of claim 3～5, it is characterized in that, when the position of described file fragment in text is divided into first area and second area, w _IqCalculate by following formula:

w_{iq} = \{\begin{matrix} \ln (\frac{m - q}{m} + β) & q = y, y &Element; 1 ~ m \\ λ^{\sqrt{f_{iq}}} & q = - 1 \end{matrix},

7. a field speech sort method is characterized in that, comprising:

N text in the selection A field is as first text;

8. method according to claim 7 is characterized in that, also comprises:

9. a text evaluation method is characterized in that, comprising:

10. method according to claim 9 is characterized in that, determines the evaluation parameter according to X field speech, described evaluation parameter and Y keyword is compared, and provide according to comparative result the described review result of evaluating text of waiting is comprised:

11. a file classification method is characterized in that, comprises

12. method according to claim 11 is characterized in that, determines describedly treat whether classifying text belongs to the A field and comprise according to comparative result:

13. a method that generates text snippet automatically is characterized in that, comprising:

Obtain pending text T, described pending text T is carried out participle;

Summary candidate sentence tissue is formed summary.

14. method according to claim 13 is characterized in that, forming a complete sentence according to the definite candidate set of making a summary of summary keyword comprises:

Determine the sentence at its place according to described summary keyword;

15. a text estimating and examining system is characterized in that, comprising:

16. system according to claim 15 is characterized in that, described evaluation unit comprises:

17. a text classification system is characterized in that, comprises

18. method according to claim 17 is characterized in that, described taxon comprises: when described similarity is not more than similarity threshold, determine that the described classifying text for the treatment of belongs to the A field; Otherwise the described classifying text for the treatment of does not belong to the A field.

19. a system that generates text snippet automatically is characterized in that, comprising:

20. system according to claim 19 is characterized in that, described candidate sentence determining unit comprises: