CN102609413A

CN102609413A - Control method and system for semantically enhanced relationship measure among word pairs

Info

Publication number: CN102609413A
Application number: CN2011100031947A
Authority: CN
Inventors: 吕钊; 曹艳娇; 蔡颂梅; 李琴; 梁璐; 俞云飞; 黄小霞; 严东宾
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2011-01-09
Filing date: 2011-01-09
Publication date: 2012-07-25

Abstract

The invention provides a novel method for measuring relationship similarity among word pairs. The method is realized through combination with a corpus and a semantic directory. For an English research object, a webpage content containing seed word pairs is captured by the corpus; and a phrase which meets the condition that the phrase begins with or ends with a seed word and is not longer than a certain threshold is extracted from the webpage content and then a corresponding mode is generated, wherein the most important part is that a part-of-speech conversion rule is appointed and transformation words of all part-of-speeches of the seed word are considered. In statistic of middle words in a follow-up mode, words which cannot reflect the semantics, such as stop-words, figures, proper nouns and the like, are filtered so that the noise is eliminated and the calculation expense is saved. On the other hand, the semantic relevance of the seed word with respective middle words in pair is calculated by using a classic semantic relevance measurement algorithm. For a Chinese research object, the situation of part-of-speech conversion is not taken into account and the generated mode is subjected to participle processing. The method provided by the invention is not limited in a fixed corpus; a new semantic resource can be obtained; and the semantic relationship among word pairs can be explored well with the help of a semantic directory.

Description

Control method and system that a kind of semantic speech that strengthens is estimated a relation

Technical field

The present invention relates to the technical field of natural language processing (NLP), be specifically combine corpus and semantic dictionary come to speech to the method estimated of semantic relation.Particularly, relate to control method and the control corresponding system that a kind of semantic speech that strengthens is estimated a relation.

Background technology

The Computer Natural Language Processing technology is the important effect of play more and more in China's modernization and informatization; To between the word and speech between relation to quantize to estimate be one of research content of natural language processing; Speech between the similarity that concerns be used for measuring the similarity degree size between their semantic relations; When two speech are very high when concerning similarity to having, just claim that they are similar.Is example with the seed speech to A: B and C: D, two the similar A that is typically expressed as that speech is right: B:: C: D, promptly the relation of A in B promptly be C in the relation of D.The correlative study achievement that concerns similarity has a wide range of applications in natural language processing field such as information retrieval, information extraction, semantic disambiguation, mechanical translation.Exactly because so, in recent years, the semantic relevancy tolerance research between the word has obtained great successes, speech between concern measuring similarity research becoming industry research focus.

At present, carry out speech, comprise based on the method for semantic resource with based on the method for statistics to concerning that the method for similarity measurement roughly is divided into two types.The similarity algorithm that concerns based on semantic resource is generally implemented by means of semantic dictionary; The kind that concerns in the semantic dictionary can embody the semantic relation between word to a certain extent; It is existing that to estimate method be to come the similarity that concerns between word is weighed through the kind of stipulating in the semantic resource that is comprised in path and the path of word in semantic resource (as above the next relation, integral part relation etc.) that concerns.But owing to need depend in the resource system that concerns between vocabulary dramatically based on the similarity algorithm that concerns of semantic resource; These concern that covering scope is limited; In the application of reality, be restricted; And the cost according to the semantic resource of demand manual construction is higher, and portable poor, therefore the similarity algorithm that concerns based on statistics becomes the main flow of current research.Method based on statistics belongs to data-driven version; Its basic thought be from corpus statistics and speech to the contextual information of co-occurrence with the calculating speech to concerning similarity; Contain rich in natural resources in the corpus, and through co-occurrence word can embody indirectly speech to semantic relation, but this method is unilateral a bit; Co-occurrence is not excavated the semantic relation between word, and it has also that noise is big, data are sparse and the problem of length consuming time in addition.

Therefore, all there is certain problem in two kinds of methods of prior art, also do not have good technical scheme to solve at present.Among the present invention two kinds of methods are combined, not only added up co-occurrence word and taked effective measures to remove the speech that can not embody semantic relation, reduced the noise in calculating to a certain extent, practiced thrift computing time through corpus; Simultaneously the application semantics dictionary excavated speech to semantic relation, avoided the defective in the classic method.

Summary of the invention

To defective of the prior art, the purpose of this invention is to provide a kind of control method and control corresponding device.

According to an aspect of the present invention, the control method that provides a kind of semantic speech that strengthens that a relation is estimated, its be used for to the English word speech in the corpus to or Chinese word speech to the similarity that concerns test and assess, it is characterized in that, comprise the steps:

A. it is right to obtain first group of speech, and institute's predicate is to comprising first word and second word, and said first word and said second word preferably are present in the said corpus;

B. with institute's predicate to being that the form of keyword " first word, second word " is retrieved in said corpus, and result for retrieval is stored to first intermediate result;

C. from said first intermediate result, extract all first word mode element words, it as the first word mode element set, and is correspondingly set up the second word mode element set;

D. respectively to each the first word mode element word in the said first word mode element set; And each second word mode element word in the said second word mode element set; Word between first word mode element word described in said first intermediate result and the said second word mode element word is added up, and its result is gathered as the right medium term of said first group of speech;

E. it is right to obtain second group of speech, and execution in step a to d, finally obtains the right medium term set of said second group of speech;

F. estimate the semantic relevancy value of any medium term during any medium term is gathered with the right medium term of said second group of speech in the set of the right medium term of the said first group of speech of algorithm computation according to semantic relevancy, and the said semantic relevancy value of preserving any two medium terms adaptably.

G. adopt following formula calculate said first group of speech to and said second group of speech right concern the similarity value:

RS (A : B : : C : D) = \frac{Σ_{i, j}^{m, n} {(rel (w_{{AB}_{i}}, w_{{CD}_{j}}) - rel)}^{2}}{m * n}

Wherein, RS represent first group of speech to A: B and second group of speech to C: concern similarity between D,

estimates the semantic relevancy value between the right arbitrary medium term of the right arbitrary medium term of first group of speech of algorithm computation and second group of speech by a certain semantic relevancy; Rel is the arithmetic mean of the semantic relevancy value between all right medium terms of two speech; Variable m and n be respectively first group of speech to the number of the medium term of second group of speech centering statistics.

Preferably, when institute's predicate when being English word, above-mentioned control method shows as following steps further:

The first step: by the dictionary in the existing English dictionary, with wherein all words and part of speech thereof are added database to.

Second step: the respective rule of formulating the part of speech conversion.

The 3rd step: add all words in the database in the database corresponding field according to its part of speech and corresponding rule is carried out after the conversion in second step word.

The 4th step: be example with the seed speech to A: B and C: D among the present invention.Is unit with arbitrary speech to A: B, retrieves as the keyword in the corpus, is saved in assigned address to the content of returning with the form of XML document.

The 5th the step: for speech to A: B; In the web page contents that its first step extracts, extract the phrase that all end up with B or its part of speech conversion speech with A or the beginning of its part of speech conversion speech; Considered symmetry simultaneously, so also extract the phrase that ends up with A or its part of speech conversion speech with B or the beginning of its part of speech conversion speech.For the length of the phrase that will extract a threshold value k is set.

The 6th step: for the phrase that extracts in the 5th step, A and part of speech conversion speech thereof and B and part of speech conversion speech thereof are replaced with X and Y respectively, so just obtained all patterns that speech is extracted A: B.

The 7th step:, add up its medium term (speech in the pattern between X and Y) for the pattern that extracts in the 6th step.In the process of statistics, stop words (stop words), proper noun (such as name, place, mechanism etc.) and numeral will not be added up, and do not embody semantic relation because they generally are used for fixing object.Stop words among the present invention specially refer in semantic dictionary inquiry less than vocabulary.In addition, if compound word is participated in statistics after then it being split again.

The 8th step: final purpose of the present invention be calculate two speech to concern similarity; Respectively to A: B and C: D carries out seven steps of the first step to the; Utilize classical semantic relevancy to estimate algorithm computation then and go out A: arbitrary medium term of B and the arbitrary medium term of C: D between any two the semantic relevancy value and deposit it in database, calculated before (knowing) through Query Database will no longer be repeated calculate.

The 9th step: on the basis in the 8th step, the formula that adopt to propose calculates speech to A: B and C: concern the similarity value between the D.

If research object is a Chinese, then on disposal route with some English difference.The first step to that is mainly reflected in the front is in three steps:

The first step: the same the 4th step.

Second step: the same the 5th step, but need not consider the part of speech conversion.

The 3rd step: the same the 6th step, but owing to do not relate to the situation of part of speech conversion, so only be that seed speech A and B are replaced and generate pattern with X and Y respectively.

The 4th step: utilize the participle software of increasing income of existing comparative maturity that each pattern is carried out participle.

The 5th step: the same the 7th step.

The 6th step: the same the 8th step.

The 7th step: the same the 9th step.

According to another aspect of the present invention; The control system that also provides a kind of semantic speech that strengthens that a relation is estimated; It is used for to the English word speech in the corpus to or Chinese word speech the similarity that concerns of part is tested and assessed; It is characterized in that said control system is carried out the process that speech is estimated a relation according to above-mentioned control method.

Compare with background technology, the present invention mainly contains following advantage:

(1) extensibility: change (whenever anyone can edit) because the corpus that adopts is a Real-time and Dynamic, so in application, can use up-to-date and carry out web page analysis than more comprehensive content.

(2) accuracy: in the process of phrase extraction, considered the part of speech and the corresponding part of speech conversion speech thereof of word, broken through with regard to having guaranteed the comprehensive of extracting phrase so before only to comprising of the extraction of seed speech to the phrase of itself.In addition, this method is not only added up the medium term of generate pattern by corpus, utilize simultaneously semantic dictionary excavate speech to semantic relation, consider to have improved result's accuracy comprehensively.

Description of drawings

Through reading the detailed description of non-limiting example being done with reference to following accompanying drawing, it is more obvious that other features, objects and advantages of the present invention will become:

Fig. 1 illustrates according to the first embodiment of the present invention, the process flow diagram of the control method that a kind of semantic speech that strengthens is estimated a relation;

Fig. 2 illustrates according to a second embodiment of the present invention, the synoptic diagram that the control method of a relation being estimated based on a kind of semantic speech that strengthens is estimated a relation english;

Fig. 3 illustrates a third embodiment in accordance with the invention, the synoptic diagram that the control method of a relation being estimated based on a kind of semantic speech that strengthens is estimated a relation Chinese word;

Fig. 4 illustrates a fourth embodiment in accordance with the invention, the synoptic diagram of the form of expression of word abridge in database in the control method of a relation being estimated based on a kind of semantic speech that strengthens; And

Fig. 5 illustrates a fourth embodiment in accordance with the invention, the synoptic diagram of the form of expression of two words and semantic relevancy thereof in the database in the control method of a relation being estimated based on a kind of semantic speech that strengthens.

Embodiment

Most of existing measurements concern that the method for similarity all is based on statistics; But this method do not excavate speech to semantic relation; And this method mainly is to add up by means of huge corpus, so data noise is bigger and consuming time.The purpose of this invention is to provide the speech that a kind of new semanteme strengthens a relation is estimated method, this method has combined to have the corpus (like Wikipedia) and a semantic dictionary (like Wordnet) of affluent resources.It not only by abundant corpus extracted speech to vocabulary, and by semantic dictionary can excavate speech to semantic relation, adopted classical semantic relevancy to estimate method among the present invention.This method broken through traditional only based on statistics with only based on the measuring method of semantic resource, played good semantic reinforced effects.

The objective of the invention is to realize like this: if research object is English, then word and part of speech thereof are saved in the database, also add database to according to the part of speech conversion part of speech conversion speech of formulating that word is all then by means of English dictionary.From corpus, extract speech to pattern, in the process of extracting, to consider the part of speech of each word and their part of speech conversion speech.Follow the number of medium term in the statistical model.Use then classical semantic relevancy estimate algorithm calculate speech in twos between the semantic relevancy of the medium term that extracts.The formula of taking at last to propose calculate speech to concern the similarity size.

Particularly, Fig. 1 illustrates according to the first embodiment of the present invention, the process flow diagram of the control method that a kind of semantic speech that strengthens is estimated a relation.Execution in step S101 at first: it is right to obtain first group of speech, and institute's predicate is to comprising first word and second word.Said first word and said second word preferably are present in the said corpus.Particularly; It will be apparent to those skilled in the art that; In the present embodiment, at first made up a corpus, and in theory; The word that has comprised all needs in this corpus, and come all the word speech in this corpus are estimated a relation through control method provided by the invention on this basis.Further; It will be apparent to those skilled in the art that; Said corpus can be obtained through a third party system; For example when control system provided by the invention need be when comparing to one group of speech, the right solicited message of obtaining speech is sent to said third party system, and the third party system according to described request information with institute's predicate to feeding back to said control system.For example in another variant, said corpus is not to set up in advance, but confirms at any time to a particular range again.For example, form a corpus according to all words in this webpage, for example preferably can analyze to set up this interim corpus web page contents through the RSS routine analyzer through various analyses to a web page contents.Further, it will be appreciated by those skilled in the art that and under the situation that such corpus quilt is set up at any time, can carry out speech to analyzing, can be applied in the enforcement demand more widely to different word contents.Particularly, the technical scheme of setting up corpus can be with reference to figure 2, embodiment illustrated in fig. 3 being achieved, and those skilled in the art combine prior art and the foregoing description can realize the foregoing description and variant, do not repeat them here.

Next gets into step S102: to being that the form of keyword " first word, second word " is retrieved, and result for retrieval is stored to first intermediate result with institute's predicate in said corpus.Behind first word of having confirmed to be estimated through above-mentioned steps S101 and second word, just can these two words are right as a speech.Then in this step, with institute's predicate to being that the form of keyword " first word, second word " is retrieved in said corpus.Particularly; It will be apparent to those skilled in the art that; Can be that the form that for example can be " the first word XXXX, second word " also can be with " the XXX first word second word XXXX " as result for retrieval as said result for retrieval according to the said retrieval of realization through the multiple technologies scheme; This does not influence flesh and blood of the present invention, does not repeat them here.Further, said result for retrieval is stored above-mentioned first intermediate result, and this first intermediate result can be employed in subsequent step.It will be apparent to those skilled in the art that; Said first intermediate result can be stored in several ways; For example can adopt the XML form, also can adopt text formatting, can also be stored in the database; Those skilled in the art combine prior art and the foregoing description can said first intermediate result, do not repeat them here.

Execution in step S103 then: from said first intermediate result, extract all first word mode element words, it as the first word mode element set, and is correspondingly set up the second word mode element set.Preferably, if first word be noun then comprise its plural number; Again for example, preferably, if first word is verb then comprises its past tense, present indefinite simple present, present perfect tense and present progressive tense; Again for example, preferably, if first word is adjective then comprises its comparative degree and the superlative degree; Again for example, preferably, if first word is adverbial word then comprises its comparative degree and the superlative degree.Similarly, also can confirm the said second word mode element set through said second word of inquiry in said first intermediate result, for example, preferably, if second word is noun then comprises its plural number; Again for example, preferably, if second word is verb then comprises its past tense, present indefinite simple present, present perfect tense and present progressive tense; Again for example, preferably, if second word is adjective then comprises its comparative degree and the superlative degree; Again for example, preferably, if second word be adverbial word then comprise its comparative degree and the superlative degree, do not repeat them here.It will be appreciated by those skilled in the art that the said first word mode element set is the set that comprises all first word mode element words, at least, comprise said first word itself in this first word mode element set.

Next get into step S104: respectively to each the first word mode element word in the said first word mode element set; And each second word mode element word in the said second word mode element set; Word between first word mode element word described in said first intermediate result and the said second word mode element word is added up, and its result is gathered as the right medium term of said first group of speech.Through above-mentioned steps S103; Obtained the said first word mode element set respectively; Comprising one or more first word mode element words, and the second corresponding word mode element set, comprising one or more second word mode element words.Then; Each first word mode element word in the said first word mode element set and each the second word mode element word in the said second word mode element set are added up, correspondingly statistics is gathered as the right medium term of said first group of speech.The combination that it will be appreciated by those skilled in the art that said first word mode element set and the said second word mode element set is the element in the right medium term set of said first group of speech, does not repeat them here.

Get into step S105 then: it is right to obtain second group of speech, and carries out said step S101 to S104, finally obtains the right medium term set of said second group of speech.Confirm second group of speech to after, can confirm the medium term set that said second group of speech is right with above-mentioned steps S101～step S104 identically, do not repeat them here.Preferably, said first group of speech is pair all different with each word of said second group of speech centering, but suboptimum ground, the part word also can be identical, and this does not influence flesh and blood of the present invention, does not repeat them here.

Next execution in step S106: estimate the semantic relevancy value of any medium term during any medium term is gathered with the right medium term of said second group of speech in the right medium term set of the said first group of speech of algorithm computation and the said semantic relevancy value of preserving any two medium terms adaptably according to semantic relevancy.Particularly, those skilled in the art can estimate the semantic relevancy value of any medium term in said two medium terms set of algorithm computation with reference to semantic relevancy of the prior art, for example can adopt Gloss Vectors algorithm at least, do not repeat them here.

Get into step S107 at last: adopt formula

calculate said first group of speech to and said second group of speech right concern the similarity value.Preferably; Said RS represent first group of speech to A: B and second group of speech to C: concern similarity between D,

Further; It will be apparent to those skilled in the art that; In a variant; Said step S104 comprises the steps: from said first intermediate result, to extract all with the beginning of the word in the first word mode element set, with the phrase of the ending of the word in the second word mode element set, and the word number of said phrase is no more than first threshold k; Count all medium terms of the phrase that extracts, and all medium terms are gathered as the right medium term of said first group of speech.

Further, it will be appreciated by those skilled in the art that in a variant that the said first word mode element word comprises following content:

-with the phrase of said first word beginning; And

-with the phrase of the part of speech conversion speech of said first word beginning.Particularly, those skilled in the art can not repeat them here with reference to figure 4 and realization embodiment illustrated in fig. 5 said part of speech conversion speech and with the phrase of the part of speech conversion speech beginning of said first word.

Further, it will be appreciated by those skilled in the art that in a variant that the said second word mode element word comprises following content:

-with the phrase of said second word beginning; And

-with the phrase of the part of speech conversion speech of said second word ending.Correspondingly, those skilled in the art can not repeat them here with reference to figure 4 and realization embodiment illustrated in fig. 5 said part of speech conversion speech and with the phrase of the part of speech conversion speech ending of said second word.

Further; It will be apparent to those skilled in the art that; In a variant, said part of speech conversion speech and said first word or second word can be obtained said part of speech conversion speech by storage in advance adaptably through the storage area of transferring the said part of speech conversion speech of storage.For example; In the database that electronic dictionary is provided pre-stored the part of speech conversion speech of related words; Correspondingly; Said control system does not repeat them here through sending after the request of obtaining part of speech conversion speech to this database and confirming said all part of speech conversion speech according to the feedback result that said database returns.

Further, it will be appreciated by those skilled in the art that in a variant that said control system is sent the query requests of inquiry said first word or second word to the third party system, confirm said part of speech conversion speech according to the feedback information of said third party system.Those skilled in the art are appreciated that this and are achieved, and do not repeat them here.Particularly, those skilled in the art can also not repeat them here with reference to figure 4 and the said process of realization embodiment illustrated in fig. 5.

Further; It will be apparent to those skilled in the art that; Can worrying at the medium term described in above-mentioned steps S104 and the step S105, some obviously belong to the word of garbage, for example preferably, said medium term do not comprise at least in the following word any or appoint multiple: stop words; Proper noun, wherein said proper noun comprises name, place, mechanism at least; Numeral etc.Particularly, those skilled in the art can not repeat them here with reference to definite said medium term content embodiment illustrated in fig. 2.

Further; It will be apparent to those skilled in the art that; In a variant, the step that the word between first word mode element word described in said first intermediate result and the said second word mode element word among the said step S104 is added up also comprises the steps:

I. judge whether the word between first word mode element word described in said first intermediate result and the said second word mode element word is compound word;

Ii is if said word is a compound word, then with its fractionation;

Iii. the word after the said fractionation is added up.

It will be appreciated by those skilled in the art that and in above-mentioned variant, done special processing, be about to compound word and be split as two or more independently words, and then add up to compound word.Those skilled in the art combine prior art and the foregoing description can realize said variant, do not repeat them here.

Fig. 2 illustrates according to a second embodiment of the present invention, the synoptic diagram that the control method of a relation being estimated based on a kind of semantic speech that strengthens is estimated a relation english.It will be appreciated by those skilled in the art that the present invention need utilize corpus to come speech is generated carrying out phrase extraction and pattern, adds up speech then to the medium term in the decimation pattern.Next utilize classical semantic relevancy Measurement Algorithm to calculate two speech to the semantic relevancy between in twos in the medium term, adopt at last a formula that proposes measure the seed speech to concern similarity.The concrete operations step is divided Chinese and English two kinds of situation.If research object is English, then practical implementation step is following:

The first step: add word in the dictionary of English dictionary and corresponding part of speech thereof to database.

Second step: formulate different transformation rules according to different parts of speech.

The 3rd step: obtain corresponding conversion speech and deposit them in database according to the part of speech of word in the database and the rule that combines to formulate.

The 4th step: utilize corpus search to contain the right webpage of seed speech and they are preserved with the form of XML file.

The 5th step: in the webpage of preserving, extract and contain the right phrase of seed speech.Is example with the seed speech to A: B, and the phrase of extraction must be with A or its part of speech conversion speech (the 3rd step) beginning and with B or the ending of its part of speech conversion speech.Considered the symmetry that speech is right simultaneously, promptly also extracted with the beginning of B or its part of speech conversion speech and with the phrase of A or the ending of its part of speech conversion speech.Length for phrase is provided with a threshold value k.After obtaining all right phrases that satisfy condition of seed speech, wherein all A or its part of speech conversion speech replaced simultaneously with X all B or its part of speech conversion speech are replaced with Y.So just generated the seed speech to all patterns.

The 6th goes on foot: the kind of the medium term (word between X and the Y) of the pattern that obtains in going on foot last is added up.

The 7th step: obtain the medium term of its all patterns (step 9 among Fig. 2 is to step 12) thereby same execution the 4th was gone on foot for the 6th step for another seed speech.

The 8th step: adopt classical semantic relevancy estimate method calculate the right medium term of two seed speech between any two semantic relevancy and it is recorded database, the speech that had before been calculated will not be to will being repeated calculating.

The 9th step: adopt the formula that proposes to calculate the concern similarity of speech to A: B and C: D.

The tenth step: finish.

Further, on above-mentioned basis embodiment illustrated in fig. 2, Fig. 3 illustrates a third embodiment in accordance with the invention, the synoptic diagram that the control method of a relation being estimated based on a kind of semantic speech that strengthens is estimated a relation Chinese word.Particularly, if research object is a Chinese, then the practical implementation step of control method provided by the invention is following:

The first step: the same the 4th step.

The 5th step: the same the 7th step.

The 6th step: the same the 8th step.

The 7th step: the same the 9th step.

The 8th step: finish.

Further, Fig. 4 and Fig. 5 show the synoptic diagram of the pilot process of the control method that a kind of semantic speech that strengthens estimates a relation jointly.Wherein, Fig. 4 illustrates a fourth embodiment in accordance with the invention, the synoptic diagram of the form of expression of word abridge in database in the control method of a relation being estimated based on a kind of semantic speech that strengthens; Fig. 5 then illustrates a fourth embodiment in accordance with the invention, the synoptic diagram of the form of expression of two words and semantic relevancy thereof in the database in the control method of a relation being estimated based on a kind of semantic speech that strengthens.

Particularly, we adopt two groups of speech to abridge:novel (summary: novel) with abbreviate:word (abbreviation: word) come whole flow process is illustrated:

Step 1: three steps of the first step to the in the 3rd page are to all words.With word abridge is example, and its form of expression in database table is as shown in Figure 4, does not repeat them here.It will be appreciated by those skilled in the art that by above-mentioned Fig. 4 and can find out that abridge has only vt. part of speech, so draw its corresponding present indefinite simple present (abridges), general past tense (abridged) and present progressive tense (abridging).Next execution in step 2:

Step 2: speech is put into abridge novel in the search box of wikipedia and retrieves, the web page contents that returns is preserved with the form of XML document, and the intercepting instance is as follows:

<item?id＝″26″>

<title>List?of?Doctor?Who?audiobooks</title>

<content>There?have?been?many?readings?of?Doctor?Whonovels，mostly?from?the?New?Series?Adventures?range.The?onlySeventh?Doctor?audiobook?to?be?released?so?far?is?a?reading?byDavid?Banks?of?his?New?Adventures?novel?for?the?RNIB.In?2006，the?BBC?began?production?of?abridged?readings?of?their?TenthDoctor?novels.In?2007，the?RNIB?produced?unabridged?versions?ofthree?of?them.There?are?also?exclusive?audiobooks?not?published?inprint.Torchwood?audiobooks?began?with?abridged?readings?ofpublished?novels，but?then?switched?to?exclusive?stories?notpublished?in?print.The?novel?was?serialised?as?abridged.</content></item>

Get into step 3 then: extract all in the content from top XML with the beginning of abridge or its part of speech conversion speech (abridges, abridged, abridging), with the phrase of novel or the ending of its part of speech conversion speech (novels); Perhaps with the beginning of novel or its part of speech conversion speech, with the phrase of abridge or the ending of its part of speech conversion speech, it is following to extract the result:

abridged?readings?of?their?Tenth?Doctor?novels

abridged?readings?of?published?novels

novel?was?serialised?as?abridged

Execution in step 4 then: for 3 phrases extracting in the last step, abridge and part of speech conversion speech thereof and novel and part of speech conversion speech thereof were replaced with X and Y respectively, and promptly obtained speech all patterns abridge:novel, promptly following:

X?readings?of?their?Tenth?Doctor?Y

X?readings?of?published?Y

Y?was?serialised?as?X

Next get into step 5: according to the pattern that extracted in the last step; Statistics medium term (word between X and Y); To remove wherein stop words (stop words), proper noun (such as name, place, mechanism etc.) and numeral simultaneously; And the word that comes out is carried out part of speech reduce, then the result is following:

reading?publish?be?serialise?as

It will be appreciated by those skilled in the art that wherein of and their are stop words, Tenth is the expression numeral, and Doctor is an appellation, is attributable to the category of proper noun, so these words are not added up.

Next get into step 6: in like manner, the step according to 1 to 5 is also operated abbreviate:word another speech accordingly, supposes to its medium term that comes out to be:

be?form?derive?indicate?in?as

Execution in step 7 then: adopt classical semantic relevancy estimate algorithm (that use in the experiment is Gloss Vectors) calculate speech to arbitrary medium term of the arbitrary medium term of abridge:novel and abbreviate:word between any two the semantic relevancy value and deposit it in database; Their forms of expression in database table are as shown in Figure 5, do not repeat them here.

Therefore, calculate 6 medium terms semantic relevancy value (then always total 5*6=30 value) between any two of 5 medium terms and the abbreviate:word of abridge:novel according to this.

Last execution in step 8: use formula provided by the invention to calculate speech to concerning the similarity value between abridge:novel and the abbreviate:word, this formula is following:

RS (A : B : : C : D) = \frac{Σ_{i, j}^{m, n} {(rel (w_{{AB}_{i}}, w_{{CD}_{j}}) - rel)}^{2}}{m * n}

More than specific embodiment of the present invention is described.It will be appreciated that the present invention is not limited to above-mentioned specific implementations, those skilled in the art can make various distortion or modification within the scope of the claims, and this does not influence flesh and blood of the present invention.

Claims

1. control method that the semantic speech that strengthens is estimated a relation, its be used for to the English word speech in the corpus to or Chinese word speech to the similarity that concerns test and assess, it is characterized in that, comprise the steps:

C. confirm all first word mode element words through the word transformation rule of formulating, it as the first word mode element set, and is correspondingly set up the second word mode element set;

RS (A : B : : C : D) = \frac{Σ_{i, j}^{m, n} {(rel (w_{{AB}_{i}}, w_{{CD}_{j}}) - rel)}^{2}}{m * n}

2. control method according to claim 1 is characterized in that, said first intermediate result is: the document of XML form.

3. control method according to claim 1 and 2 is characterized in that, the said first word mode element word comprise at least in the following content any or appoint multiple:

If-the first word is noun then comprises its plural number;

If-the first word is verb then comprises its past tense, present indefinite simple present, present perfect tense and present progressive tense;

If-the first word is adjective then comprises its comparative degree and the superlative degree; And

If-the first word is adverbial word then comprises its comparative degree and the superlative degree.

4. according to each described control method in the claim 1 to 3, it is characterized in that, the said second word mode element word comprise at least in the following content any or appoint multiple:

If-the second word is noun then comprises its plural number;

If-the second word is verb then comprises its past tense, present indefinite simple present, present perfect tense and present progressive tense;

If-the second word is adjective then comprises its comparative degree and the superlative degree; And

If-the second word is adverbial word then comprises its comparative degree and the superlative degree.

5. according to claim 3 or 4 described control methods, it is characterized in that said part of speech conversion speech obtains through following any mode:

-said part of speech conversion speech and said first word or second word can be obtained said part of speech conversion speech by storage in advance adaptably through the storage area of transferring the said part of speech conversion speech of storage; Perhaps

-send the query requests of inquiry said first word or second word to the third party system, confirm said part of speech conversion speech according to the feedback information of said third party system.

6. according to each described control method in the claim 1 to 5, it is characterized in that said steps d comprises the steps:

-from said first intermediate result, extract all with the beginning of the word in the first word mode element set, with the phrase of the ending of the word in the second word mode element set, wherein, the word number of said phrase is no more than first threshold k.

7. control method according to claim 6 is characterized in that said steps d also comprises the steps: to count all medium terms of the phrase that extracts, and its result is gathered as the right medium term of said first group of speech.

8. according to each described control method in the claim 1 to 7, it is characterized in that, said medium term do not comprise at least in the following word any or appoint multiple:

-stop words;

-proper noun, wherein said proper noun comprises name, place, mechanism at least;

-numeral.

9. according to each described control method in the claim 1 to 8; It is characterized in that the step that the word between first word mode element word described in said first intermediate result and the said second word mode element word in the said steps d is added up also comprises the steps:

Ii is if said word is a compound word, then with its fractionation;

Iii. the word after the said fractionation is added up.

10. according to each described control method in the claim 1 to 9, it is characterized in that said semantic relevancy is estimated algorithm and comprised Gloss Vectors algorithm at least.

11. control system that the semantic speech that strengthens is estimated a relation; It is used for to the English word speech in the corpus to or Chinese word speech to the similarity that concerns test and assess; It is characterized in that said control system is carried out the process that speech is estimated a relation according to each described control method in the claim 1 to 10.