CN1916889B

CN1916889B - Language material storage preparation device and its method

Info

Publication number: CN1916889B
Application number: CN2005100932280A
Authority: CN
Inventors: 伊藤荣朗; 桑原祯司; 黑田昌芳; 虞立群; 陈奕秋; 汪更正
Original assignee: Shanghai Jiaotong University; Hitachi Ltd
Current assignee: Shanghai Jiaotong University; Hitachi Ltd
Priority date: 2005-08-19
Filing date: 2005-08-19
Publication date: 2011-02-02
Anticipated expiration: 2025-08-19
Also published as: CN1916889A

Abstract

A preparing device of word library consists of word drawing out unit, calculation unit of word appearance frequency, calculation unit of correlation degree, word preparing unit and relation preparing unit. It is featured as using relation preparing unit to set up longitudinal contained relation structure in tree form for word obtained from word drawing out unit based on basic meaning between words.

Description

Language material storage preparation device and method thereof

Technical field

The present invention relates to the producing device and the method thereof of a kind of corpus (Corpus), more particularly, the present invention relates to a kind of producing device and the method thereof that can analyze the corpus (Corpus) of semantic relation, statistical dependence relation and similarity relation between word.

Background technology

Now, various information blend together that the mankind provide convenient, fast, effective information, have also brought such problem simultaneously, that is, and and organization and administration and finally effectively utilize these information how effectively.At present, Chang Yong information storage means have based on the method for dictionary with based on the method for knowledge base.

Corpus is the warehouse that is used for storing linguistic data, and its inner a large amount of linguistic data can be widely used in computer search, searches and analyze.

The method for making of existing corpus comprises the method based on dictionary.In the method, will come out with the corresponding to segmentation of words of the word in the dictinary information that is had in advance.Because the word major part that exists in the dictionary can correctly cut out, thus in corpus, seldom comprise the information that is not word, so can generate high-precision corpus.But, need store a large amount of storage spaces of dictionary based on the method for dictionary, so be unfavorable on portable set, using this method.Simultaneously, because the word that exists in the cutting dictionary only, so as special professional word or up-to-date word, can not cut out in the corpus as word information.In addition, in method, be difficult to quantize (quantization), thereby be difficult to it is applied in the middle of the digitizer about the information of the relation between the word based on dictionary.

Though the corpus according to the prior art structure respectively has characteristics, but its total weak point is that what generally deposit in the corpus all is word, and does not reflect the relation between the word, so the information that can provide is fewer, the application that should be able to provide just is restricted mutually.

Summary of the invention

At the problem that prior art exists, one of purpose of the present invention is to provide a kind of can analyze the producing device of the corpus of semantic relation, statistical dependence relation and similarity relation between word at finite space storage word as much as possible.

The producing device of corpus of the present invention, it is except comprising word extraction unit, frequency of occurrences calculating part, degree of association calculating part, language material storage preparation portion, it is characterized in that, this language material storage preparation device also comprises relation of inclusion preparing department, wherein, described word extraction unit is carried out cutting to training sample, obtains word sequence; Described relation of inclusion preparing department sets up vertical relation of inclusion structure to the word that the word extraction unit obtains with tree structure based on the semanteme between the word, and described frequency of occurrences calculating part calculates co-occurrence frequency and the co-occurrence distance between word; Described degree of association calculating part calculates the degree of correlation and similarity between word according to the result of described relation of inclusion preparing department and described frequency of occurrences calculating part; Described language material storage preparation portion deposits vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion.This vertical relation of inclusion structure is the relation of inclusion of the last subordinate concept of the semanteme between the word of representing to be stored.

In language material storage preparation device of the present invention, frequency of occurrences calculating part can be by the degree of correlation between the described word of following formula (1) calculating (that is co-occurrence weight ):

\underset{{rel}_{w_{1} w_{2}}}{&OverBar;} = \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\underset{{dist}_{w_{1} w_{2}}}{&OverBar;} + γ} - - - (1)

Wherein,

Figure DEST_PATH_RE-G200510093228001D00014

Represent the word w that described frequency of occurrences calculating part calculates respectively ₁With word w ₂Co-occurrence frequency, word w ₁The frequency of occurrences and word w ₁With word w ₂Between average co-occurrence distance, γ is an adjustable parameter.In addition, in language material storage preparation device of the present invention, degree of association calculating part is pressed the similarity sim (w between above-mentioned two words of following formula (2) calculating ₁, w ₂):

sim(w ₁，w ₂)＝αsim _semantic(w ₁，w ₂)+βsim _statistic(w ₁，w ₂) (2)

α, β ∈ (0,1) and alpha+beta=1

Wherein, sim _Semantic(w ₁, w ₂) expression word w ₁With word w ₂Semantic similarity, sim _Statistic(w ₁, w ₂) the described word w of expression ₁With word w ₂Between the statistical dependence degree, α and β are adjustable parameter.

In addition, in language material storage preparation device of the present invention, degree of association calculating part can calculate described word w by following formula (3) ₁With word w ₂Semantic similarity sim _Semantic(w ₁, w ₂):

sim _semantic(w ₁，w ₂)＝1/Dis _semantic(w ₁，w ₂) (3)

Dis wherein _Semantic(w ₁, w ₂) the word w that obtains in vertical relation of inclusion structure of constituting according to relation of inclusion preparing department of expression ₁With word w ₂Between bee-line.

Above-mentioned word w ₁With word w ₂Between statistical dependence degree sim _Statistic(w ₁, w ₂) be word w ₁With word w ₂The degree of correlation

Figure DEST_PATH_RE-G200510093228001D00015

At this " word w that mentions ₁With word w ₂Bee-line Dis _Semantic(w ₁, w ₂) " be meant, in concerning the vertical relation of inclusion structure of word that preparing department constitutes, word w ₁With word w ₂Between bee-line.

" word w ₁Frequency of occurrences freqw ₁" be meant word w ₁(benchmark speech) concentrates the total number of times that occurs at training sample.

" co-occurrence " is meant window width

In, with regard to word w among the training sample L (L belongs to training sample and concentrates any sample) ₁Certain appear as starting point, to thereafter Individual word is observed, and obtains set of words

If find speech Then say word w ₁With word w ₂At window width

Middle co-occurrence.

" co-occurrence frequency freqw ₁, w ₂" be meant word w ₁With word w ₂Concentrate the number of times that appears at simultaneously in certain default window width at training sample.

" co-occurrence is apart from dis _W1w2) " be meant word w ₁With word w ₂Word w when in default window width, occurring simultaneously ₂Apart from word w ₁The position distance.

" average co-occurrence distance

Be meant,

\overset{&OverBar;}{{dis}_{w_{1} w_{2}}} = Σ_{K = 1}^{{freq}_{w_{1} w_{2}}} {({dis}_{w_{1} w_{2}})}_{k},

Dis wherein _W1w2) _kExpression word w ₁With word w ₂The k time co-occurrence distance.

In addition, in language material storage preparation device of the present invention, frequency of occurrences calculating part can calculate word w by following formula (4) _iConcern number k _i:

k_{i} = \frac{lgwfre q_{i}}{\lg \overset{&OverBar;}{wfreq}} \times k, - - - (4)

Wherein,

The average frequency of occurrences of representing all words in the described corpus, wfreq _iExpression word w _iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;

As word w _iRelation sum surpass δ * k _iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w _iThe word w that concerns the weight minimum _jReduce, this concerns that weights W eight (Relation) presses following formula (5) and calculates:

Weight (Relation) = {freq}_{w_{i} w_{j}} \times {Weight}_{w_{i} w_{j}} - - - (5)

Wherein, freq _WiwjExpression word w _iWith word w _jCo-occurrence frequency, Weight _WiwjExpression word w _iWith word w _jThe co-occurrence weight.

Another object of the present invention also is to provide a kind of language material storage preparation method.

This language material storage preparation method may further comprise the steps:

Step extracted out in word: the training sample content is carried out cutting, obtain word sequence;

Relation of inclusion making step:, the word that word extraction step obtains is set up vertical relation of inclusion structure with tree structure based on the semanteme between the word;

Frequency of occurrences calculation procedure: calculate the frequency of occurrences of word, co-occurrence frequency, co-occurrence distance and average co-occurrence distance between two words;

Degree of correlation similarity calculation procedure: according to the degree of correlation between two words of result of calculation calculating of described frequency of occurrences calculation procedure, and then according to vertical relation of inclusion structure of described relation of inclusion making step foundation and the similarity between two words of described relatedness computation;

Language material storage preparation step: the word that obtains in the above step, relation of inclusion, the degree of correlation and the similarity between them are constructed corpus as record.

According to language material storage preparation method of the present invention, in its degree of correlation similarity calculation procedure, the degree of correlation between two words

Figure DEST_PATH_RE-GSB00000092118300011

(that is co-occurrence weight

Figure DEST_PATH_RE-GSB00000092118300012

) can be calculated as follows:

\overset{&OverBar;}{{rel}_{w_{1} w_{2}}} = \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\overset{&OverBar;}{{dist}_{w_{1} w_{1}}} + γ} - - - (1)

Wherein,

Figure DEST_PATH_RE-GSB00000092118300014

Represent the word w that obtains from frequency of occurrences calculation procedure respectively ₁With word w ₂Co-occurrence frequency, word w ₁The frequency of occurrences and word w ₁With word w ₂Between average co-occurrence distance, γ is an adjustable parameter.

Above-mentioned word w ₁With word w ₂Semantic similarity sim _Semantic(w ₁, w ₂) can calculate by following formula (3):

sim _semantic(w ₁，w ₂)＝1/Dis _semantic(w ₁，w ₂) (3)

Dis wherein _Semantic(w ₁, w ₂) the word w that obtains in described vertical relation of inclusion structure of setting up according to described relation of inclusion making step of expression ₁With word w ₂Between bee-line.

Above-mentioned word w ₁With word w ₂Between statistical dependence degree sim _Statistic(w ₁, w ₂) be word w ₁With word w ₂The degree of correlation.

In addition, language material storage preparation method of the present invention can also comprise the reduction step: press following formula (4) and calculate word w _iConcern number k _i:

k_{i} = \frac{\lg {wfreq}_{i}}{\lg \overset{&OverBar;}{wfreq}} \times k, - - - (4)

Wherein, The average frequency of occurrences of all words in the expression corpus, wfreq _iExpression word w _iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;

Weight (Relation) = {freq}_{w_{i} w_{j}} \times {Weight}_{w_{i} w_{j}} - - - (5)

Wherein, freq _WiwjExpression word w _iWith word w _jCo-occurrence frequency, Weight _WiwjExpression word w _iWith word w _jThe co-occurrence weight

According to language material storage preparation device of the present invention and preparation method thereof a large amount of storage spaces of needs storages dictionary not, in stores words, not only the horizontal relationship between the stores words (statistical dependence relation) is analyzed, can also be simultaneously between the word vertically relation (semantic last subordinate concept relation of inclusion) analyze and laterally reach similarity between vertical relationship analysis word based on this.Promptly have vertical relation of inclusion structure, network of relation, similar network between word simultaneously according to the resultant corpus of language material storage preparation device of the present invention and preparation method thereof, therefore, the corpus that use is made according to the present invention not only can organically be organized various information, and be convenient to more according to user's requirement information be classified, in the data of magnanimity, find individual information of interest.Therefore, the corpus of making thus can be used in the application such as for example information retrieval, information extraction, training sample classification, the selection of intelligent television program.

In addition, according to language material storage preparation device of the present invention and preparation method thereof, when increase along with training sample, network of relation in the corpus constantly expands, because the present invention takes the reduction scheme that suits, make the burden of corpus physical space alleviate, to keep the efficient of degree of correlation similarity analysis between word storage and word.

In addition, according to language material storage preparation device of the present invention and preparation method thereof, because network of relation specific memory structure and the utilization of reducing algorithm, make the word of preserving in the corpus have the property of dynamically updating.Promptly, when having occurred existing word in the corpus in the training sample, new training sample might be introduced new relation for this word, when the relation sum of this word surpasses the reduction threshold value, just according to above-mentioned reduction scheme it is concerned reduction, thereby when introducing new relation, eliminate weak relation, the corpus that makes made can dynamically update according to training sample when keeping certain range of capacity.

Description of drawings

Fig. 1 is the structural representation of an embodiment of language material storage preparation device of the present invention;

Fig. 2 is the workflow diagram of the word extraction unit of this kind embodiment of the present invention;

Fig. 3 represents the vertical relation of inclusion structure between the word that is made of relation of inclusion preparing department of this kind embodiment of the present invention;

Fig. 4 is the base conditioning process flow diagram of the frequency of occurrences calculating part of this kind embodiment of the present invention;

Fig. 5 is the process flow diagram that the degree of association calculating part of this kind embodiment of the present invention calculates similarity;

Fig. 6 is the structural drawing of the resulting corpus of this kind embodiment of the present invention;

Fig. 7 is an example of the resulting vertical relation of inclusion structure of relation of inclusion preparing department of this kind embodiment of the present invention;

Fig. 8 is an example of the structural drawing of the resulting corpus of this kind embodiment of the present invention.

Embodiment

Hereinafter, the embodiment shown in reference to the accompanying drawings makes an explanation to the present invention.

Fig. 1 is the structural representation of an embodiment of language material storage preparation device of the present invention, wherein represents language material storage preparation device with Reference numeral 100.This language material storage preparation device 100 comprises word extraction unit 104, relation of inclusion preparing department 106, frequency of occurrences calculating part 108, degree of association calculating part 110, language material storage preparation portion 112.

Training sample 102 is divided into word sequence through word extraction unit 104, make vertical relation of inclusion between word via relation of inclusion preparing department 106 according to the relation of the last subordinate concept of the semanteme between word, calculate co-occurrence frequency and co-occurrence distance between word via frequency of occurrences calculating part 108, calculate the degree of correlation and similarity between word via degree of association calculating part 110, deposit vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion 114 by language material storage preparation portion 112 again.

Below will make detailed description to each part mentioned above.

Training sample 102 is meant the language material that is used to train, for example, and article.Language material is used to construct the degree of correlation network of corpus, its must possess language material big, contain wide, have certain authoritative condition, to guarantee and can carry out objective appraisal to the various algorithms of setting up thereon.

Word extraction unit 104 is mainly used to training sample 102 is carried out lexical analysis, by the natural language processing instrument content of training sample is carried out cutting, obtains word sequence.In Chinese information processing system, can adopt the method for self study that training sample is carried out cutting.This method for example can by the repeatedly iteration of EM (Expectation-Maximization) algorithm, finally obtain the best cutting result of training sample based on maximum likelihood principle.

Fig. 2 has provided the treatment scheme according to this method of this embodiment.The training sample 102 that reads in proposes legal character through the unallowable instruction digit processing module in the word extraction unit 104 204 and deposits in the interim training sample, come training sample is carried out cutting by the record of searching database with training sample cutting module 208 on the one hand then, utilize this sample that database is suitably upgraded by self-learning module 206 on the other hand.

Relation of inclusion preparing department 106 is used for making the vertical relation of inclusion structure between word.This vertical relation of inclusion structure is actually that relation of inclusion based on the last subordinate concept of the semanteme between the notion word obtains.Fig. 3 shows this vertical relation that characterizes with tree structure.We represent relation of inclusion between the node with set membership on such semantic tree.In other words, the word of father node (Fa_cnpt) representative is semantically comprising the word of child node (Son_cnpt) representative.Vertically the training key of relation of inclusion structure is to organize a semantic forest, and this semanteme forest has comprised a lot of semantic trees again.This needs philological knowledge, can obtain semantic tree by the method for synonymicon or expert's classification.In this embodiment, expert's classification (knowing net) has been used for reference in the foundation of semantic tree, and passes through the manual sort and obtain.

Like this, just constituted vertical relation of inclusion structure of corpus.

Frequency of occurrences calculating part 108 is used for calculating co-occurrence distance and the co-occurrence frequency between the word.The base conditioning flow process of frequency of occurrences calculating part 108 as shown in Figure 4.At first, frequency of occurrences calculating part 108 receives the result of word extraction unit 104, i.e. word sequence.The window that to preestablish a width be w is thought the co-occurrence frequency of these two words for once if certain two word occurs simultaneously in window, and be spaced apart the co-occurrence distance between two words.

Based on co-occurrence distance and the co-occurrence frequency between the word, frequency of occurrences calculating part 108 is by the co-occurrence weights W eight between following formula (1) the calculating word _W1w2, that is the degree of correlation

\overset{&OverBar;}{re l_{w_{1} w_{2}}} = \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\overset{&OverBar;}{dis t_{w_{1} w_{2}}} + γ} - - - (1)

Wherein, freq _W1w2, freq _W1, The word w that frequency computation part portion 108 calculates appears in expression respectively ₁With word w ₂Co-occurrence frequency, word w ₁The frequency of occurrences and word w ₁With word w ₂Between average co-occurrence distance, γ is an adjustable parameter.

In addition, the relation in the corpus is a lot, and along with the increase of training sample, the network of relation in the corpus can constantly expand, and makes that the burden of physical space is quite heavy.Therefore need an expansion of reducing on its space of algorithm controls.As follows at the relation reduction algorithm that the frequency of occurrences calculating part shown in Fig. 4 108 adopts:

k_{i} = \frac{lgwfre q_{i}}{\lg \overset{&OverBar;}{wfreq}} \times k - - - (4)

Wherein

Be the average frequency of occurrences of all words in the corpus, wfreq _iBe word w _iThe frequency of occurrences.K is the average relationship number of all words in the corpus.The process of reducing is a dynamic process, as the total threshold value δ * k that surpasses of the relation of certain word _iWhen (δ is predefined greater than 1 cushioning coefficient), it is reduced.That reduces concerns the relation of weight minimum to liking those.Computing method are as follows:

Weight (Relation) = {freq}_{w_{i} w_{j}} \times {Weight}_{w_{i} w_{j}} - - - (5)

Freq _WiwjExpression word w _iWith word w _jCo-occurrence frequency, Weight _WiwjExpression word w _iWith word w _jThe co-occurrence weight.

Like this, constitute the network of relation of corpus based on the processing of above each several part.

Next, the similarity between 110 pairs of words of degree of association calculating part is calculated.Calculation of similarity degree is illustrated with reference to Fig. 5.At first, calculate and obtain two bee-line Dis between word according to the vertical relation of inclusion structure in the corpus _Semantic(w1, w2) (502).Then, according to the bee-line Dis of gained _Semantic(w1 w2) calculates with sim _Semantic(w ₁, w ₂) expression word w ₁With word w ₂Semantic similarity (504).Then, the network of relation based on corpus calculates with sim _Statistic(w ₁, w ₂) be word w ₁With word w ₂The degree of correlation (506).Then, according to step 504 and 506 gained results, by the similarity sim (w between two words of following formula (2) calculating ₁, w ₂):

sim(w ₁，w ₂)＝αsim _semantic(w ₁，w ₂)+βsim _statistic(w ₁，w ₂)(2)

α, β ∈ (0,1) and alpha+beta=1

Wherein, α and β are adjustable parameter.

Like this, just constitute the similar network of corpus by above processing.

Language material storage preparation portion 112 as input, is kept at vertical relation of inclusion structure of the corpus that is made of relation of inclusion preparing department, frequency of occurrences calculating part, degree of association calculating part and exports, network of relation, similar network in the corpus preservation portion 114.

Fig. 6 has provided the structural drawing of the corpus that finally obtains, and in Fig. 6, each node is represented a word, and wherein, the left side is vertical relation of inclusion, and the right is horizontal correlationship, and dotted line is represented the separately effect of expression of same node.In fact the node of dotted line connection is the different ingredients of same node.The part on the left side is similar to Fig. 3, does not repeat them here.Among the figure on the right, the connection of expression correlationship above, the connection of expression similarity relation below.Top correlationship is connected to relevant word and marks relevant frequency and distance.Following similarity relation is connected to similar word and marks similar degree.

The embodiment of language material storage preparation method of the present invention can be used the embodiment of language material storage preparation device of the present invention, realizes its word extraction step with Fig. 2, Fig. 3, Fig. 4, mode shown in Figure 5; The relation of inclusion making step; Frequency of occurrences calculation procedure; Degree of correlation similarity calculation procedure; And the language material storage preparation step, have the corpus of vertical relation of inclusion structure between word, network of relation, similar network when obtaining as shown in Figure 6.

(embodiment)

Below specify the flow process of language material storage preparation of the present invention with an example.

In this example, training sample adopts one section following article:

Europe man's gymnastics match closing

The 19 the European man's gymnastics championship in Xinhua News Agency Lausanne May 27 (reporter Shi Guangyao) contended through 3 days, and finish at Lausanne, SUI afternoon on the 27th.The good strong wind of Soviet Union player seizes 6 pieces (1 piece side by side) in whole 8 pieces of gold medals.Soviet Union star Mo Jilini obtains individual all-round competition, pommel horse and 3 pieces of gold medals of parallel bars (side by side), Suo Heerbo gain freedom gymnastics, horse-vaulting and three champions of horizontal bar.Sharp Buddhist nun of Switzerland player Ji Beier and Italian player Kai Ji obtain parallel bars and champion of the swinging rings respectively.73 sportsmen from 25 European countries have participated in current match.(End)

Word extraction unit 104 is utilized segmentation of words instrument, and the content of one piece of article is cut into one by one independently word, wherein mainly extracts noun out.The result is as follows in output:

Europe man's gymnastics match Lausanne reporter of Xinhua News Agency shines European man's gymnastics championship and contends the Lausanne, SUI curtain Soviet Union player strong wind gold medal Soviet Union star individual all-round competition pommel horse parallel bars gold medal horse-vaulting horizontal bar champion Switzerland player Bel Italy player parallel bars champion of the swinging rings sportsman of European countries match

Relation of inclusion preparing department 106 uses for reference expert's classification (knowing net) output result as shown in Figure 7 simultaneously according to the output of word extraction unit, promptly vertical relation of inclusion structure.

Frequency of occurrences calculating part 108 receives the set of the word of cutting, is in the window of w by predefined width, and pieces of training sample is scanned.If certain two word occurs in window simultaneously, then think these two word co-occurrences once, be spaced apart the co-occurrence distance between two words.Through statistics, obtain with each speech as keyword the average co-occurrence distance and the co-occurrence frequency of the word of other relevant with this keyword.

In following table, " KEY " represents keyword; " REL_NODE " expression interdependent node " frequency " expression co-occurrence frequency; Ave_dis represents average co-occurrence distance.KEY: man

REL_NODE[1]=gymnastics match ave_dis=1.000000 frequency=1

REL_NODE[2]=the ave_dis=2.000000 frequency=1 of Xinhua News Agency

REL_NODE[3]=Lausanne ave_dis=4.000000 frequency=2

REL_NODE[4]=reporter ave_dis=4.000000 frequency=1

REL_NODE[5]=shine ave_dis=5.000000 frequency=1

REL_NODE[6]=gymnastics ave_dis=1.000000 frequency=1

REL_NODE[7]=championship ave_dis=2.000000 frequency=1

REL_NODE[8]=contention ave_dis=3.000000 frequency=1

REL_NODE[9]=Switzerland ave_dis=4.000000 frequency=1

As can be seen from the above table, for example, the average co-occurrence distance of " man " and " reporter " is 4.000000, and co-occurrence frequency is 1.

Degree of association calculating part 110 calculates the degree of correlation between two words and the similarity between two words.At first average co-occurrence distance between the word that obtains according to frequency of occurrences calculating part statistics and co-occurrence frequency calculate the degree of correlation between word and the word:

\overset{&OverBar;}{re l_{w_{1} w_{2}}} = \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\overset{&OverBar;}{dis t_{w_{1} w_{2}}} + γ} - - - (1)

Wherein,

Be the mean distance ave_dis in the table, γ is an adjustable parameter.So just can obtain the degree of correlation between two speech.

If get γ=0.5, as above the mean distance of " man " and " reporter " is 4.000000, and co-occurrence frequency is 1, if the frequency of occurrences of this moment " man " is 10, and the degree of correlation of two speech then

\overset{&OverBar;}{re l_{w_{1} w_{2}}} = \frac{1}{10} \times \frac{0.5}{4.0 + 0.5} = 0.01111

According to similarity calculating method mentioned above, based on the vertical relation of inclusion structure and the degree of correlation network of above corpus, use formula (2) is calculated the similarity in this example:

α, β ∈ (0,1) and alpha+beta=1

In vertical relation of inclusion, for example, if two words have set membership, bee-line Dis therebetween _Semantic(w ₁, w ₂) can be taken as 2, its semantic similarity sim _Semantic(w ₁, w ₂) be 0.5.Sim _Statistic(w ₁, w ₂) expression w ₁And w ₂The statistical dependence degree, be the degree of correlation that following formula (1) calculates gained

For example, get α=0.4 at this, the similarity between " man " and " reporter " is calculated in β=0.6.Owing to do not have set membership, sim between " man " and " reporter " _Semantic(w ₁, w ₂) be 0, sim _Statistic(w ₁, w ₂) be 0.01111 of formula (1) calculating gained.Then the similarity between two speech obtains with formula (2), sim (W ₁, W ₂)=0.4 * 0+0.6 * 0.01111=0.006666

The record that will meet the corpus structure thus by language material storage preparation portion 112, for example, the key words of statistics gained as " man ", related words, is kept in the corpus preservation portion 114 as " reporter " and their similarity " 0.006666 " form with a data-base recording above.The corpus structure that constitutes according to this example as shown in Figure 8.Like this, comprise word, key words has constituted corpus with the record of vertical relation of inclusion structure, its corresponding network of relation and the similar network of related words.When the degree of correlation of two words of needs or similarity information, just from this corpus, read.

In addition, start the reduction condition (that is: as adding a new relation (newrelation) " man and football " this moment, satisfying

\frac{\lg {wfreq}_{i}}{\lg \overset{&OverBar;}{wfreq}} \times k_{i} \times δ < 10

), window width w=6, γ=3, and this moment ave_dis=1, frequency=1,, then by formula (5) the weight of above-mentioned new relation:

Weight (newrelation) = {freq}_{w_{1} w_{2}} \times {weight}_{w_{1} w_{2}} = {freq}_{w_{1} w_{2}} \times \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\overset{&OverBar;}{{dist}_{w_{1} w_{2}}} + γ}

= 1 \times 1 / 10 \times (3 / (1 + 3)) = 0.075,

And the pass of this moment " man " and " Switzerland " is Weight (relation)=1 * 1/10 * (3/ (4+3))=0.043, so this relation will be reduced, new relation " man " and " football " then are added into corpus.Thus, make corpus obtain upgrading.

Above-mentioned example is just in order to illustrate the example of embodiment of the present invention, and the present invention also can adopt other implementation of modification to carry out.Language material storage preparation device can be that core devices constitutes with the processor.The corpus of making can be realized with memory device commonly used such as hard disk, disk.

More than language material storage preparation device of the present invention and method thereof have been done detailed explanation.Modification that those skilled in the art are made within the spirit and scope of the present invention and improvement should be included in the appended claim restricted portion of the present invention.

Claims

1. language material storage preparation device that comprises word extraction unit, frequency of occurrences calculating part, degree of association calculating part, language material storage preparation portion, it is characterized in that: this language material storage preparation device also comprises relation of inclusion preparing department, wherein,

Described word extraction unit is carried out cutting to training sample, obtains word sequence;

Described relation of inclusion preparing department sets up vertical relation of inclusion structure to the speech that described word extraction unit obtains with tree structure based on the semanteme between the word;

Described frequency of occurrences calculating part calculates co-occurrence frequency and the co-occurrence distance between word;

Described degree of association calculating part calculates the degree of correlation between word according to the result of calculation of described frequency of occurrences calculating part, and then the vertical relation of inclusion structure set up according to described relation of inclusion preparing department and the similarity between described relatedness computation word;

Described language material storage preparation portion deposits vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion.

2. language material storage preparation device according to claim 1 is characterized in that, described frequency of occurrences calculating part is pressed the degree of correlation between the described word of following formula (1) calculating

\overset{&OverBar;}{{rel}_{w_{1} w_{2}}} = \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\overset{&OverBar;}{{dist}_{w_{1} w_{2}}} + γ} - - - (1)

Wherein,

Represent the word w that described frequency of occurrences calculating part obtains respectively ₁With word w ₂Co-occurrence frequency, word w ₁The frequency of occurrences and word w ₁With word w ₂Between average co-occurrence distance, γ is an adjustable parameter.

3. language material storage preparation device according to claim 2 is characterized in that: described degree of association calculating part is pressed the similarity sim (w between two words of following formula (2) calculating ₁, w ₂):

α, β ∈ (0,1) and alpha+beta=1

4. language material storage preparation device according to claim 3 is characterized in that: described degree of association calculating part is pressed following formula (3) and is calculated described word w ₁With word w ₂Semantic similarity sim _Semantic(w ₁, w ₂):

sim _semantic(w ₁，w ₂)＝1/Dis _semantic(w ₁，w ₂) (3)

Dis wherein _Semantic(w ₁, w ₂) the word w that obtains in described vertical relation of inclusion structure of constituting according to described relation of inclusion preparing department of expression ₁With word w ₂Between bee-line.

5. language material storage preparation device according to claim 3 is characterized in that: described word w ₁With word w ₂Between statistical dependence degree sim _Statistic(w ₁, w ₂) be described word w ₁With word w ₂The degree of correlation

6. language material storage preparation device according to claim 1 is characterized in that: at described frequency of occurrences calculating part, press following formula (4) and calculate word w _iConcern number k _i:

k_{i} = \frac{\lg {wfreq}_{i}}{\lg \overset{&OverBar;}{wfreq}} \times k, - - - (4)

Wherein,

As word w _iRelation sum surpass threshold value δ * k _iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w _iThe word w that concerns the weight minimum _jReduce, the described weights W eight (Relation) that concerns presses following formula (5) calculating:

Weight (Relation) = {freq}_{w_{i} w_{j}} \times {Weight}_{w_{i} w_{j}} - - - (5)

Wherein, Expression word w _iWith word w _jCo-occurrence frequency, Expression word w _iWith word w _jThe co-occurrence weight.

7. language material storage preparation method is characterized in that: may further comprise the steps:

Word extraction step: the training sample content is carried out cutting, obtain word sequence;

Language material storage preparation step: the word that obtains in the above step, vertical relation of inclusion structure, the degree of correlation and the similarity between them are constructed corpus as record.

8. language material storage preparation method according to claim 7 is characterized in that: in described degree of correlation similarity calculation procedure, be calculated as follows the degree of correlation between described two words

\overset{&OverBar;}{{rel}_{w_{1} w_{2}}} = \frac{{freq}_{w_{1} w_{2}}}{{freq}_{w_{1}}} \times \frac{γ}{\overset{&OverBar;}{{dist}_{w_{1} w_{2}}} + γ} - - - (1)

Wherein,

Represent the word w that obtains from described frequency of occurrences calculation procedure respectively ₁With word w ₂Co-occurrence frequency, word w ₁The frequency of occurrences and word w ₁With word w ₂Between average co-occurrence distance, γ is an adjustable parameter.

9. language material storage preparation method according to claim 8 is characterized in that: in described degree of correlation similarity calculation procedure, as shown in the formula the similarity sim (w between described two words of (2) calculating ₁, w ₂):

α, β ∈ (0,1) and alpha+beta=1

Wherein, sim _Semantic(w ₁, w ₂) expression word w ₁With word w ₂Semantic similarity, sim _Statistic(w ₁, w ₂) be described word w ₁With word w ₂Between the statistical dependence degree, α and β are adjustable parameter.

10. language material storage preparation method according to claim 9 is characterized in that: described word w ₁With word w ₂Semantic similarity sim _Semantic(w ₁, w ₂) calculate by following formula (3):

sim _semantic(w ₁，w ₂)＝1/Dis _semantic(w ₁，w ₂) (3)

11. language material storage preparation method according to claim 9 is characterized in that: described word w ₁With word w ₂Between statistical dependence degree sim _Statistic(w ₁, w ₂) be described word w ₁With word w ₂Between the degree of correlation

12. language material storage preparation method according to claim 7 is characterized in that, also comprises the reduction step: press following formula (4) and calculate word w _iConcern number k _i:

k_{i} = \frac{\lg {wfreq}_{i}}{\lg \overset{&OverBar;}{wfreq}} \times k, - - - (4)

Wherein,

Weight (Relation) = {freq}_{w_{i} w_{j}} \times {Weight}_{w_{i} w_{j}} - - - (5)

Wherein,

Expression word w _iWith word w _jCo-occurrence frequency, Expression word w _iWith word w _jThe co-occurrence weight.