CN1916889B - Language material storage preparation device and its method - Google Patents

Language material storage preparation device and its method Download PDF

Info

Publication number
CN1916889B
CN1916889B CN2005100932280A CN200510093228A CN1916889B CN 1916889 B CN1916889 B CN 1916889B CN 2005100932280 A CN2005100932280 A CN 2005100932280A CN 200510093228 A CN200510093228 A CN 200510093228A CN 1916889 B CN1916889 B CN 1916889B
Authority
CN
China
Prior art keywords
word
relation
frequency
degree
occurrences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2005100932280A
Other languages
Chinese (zh)
Other versions
CN1916889A (en
Inventor
伊藤荣朗
桑原祯司
黑田昌芳
虞立群
陈奕秋
汪更正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Hitachi Ltd
Original Assignee
Shanghai Jiaotong University
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University, Hitachi Ltd filed Critical Shanghai Jiaotong University
Priority to CN2005100932280A priority Critical patent/CN1916889B/en
Publication of CN1916889A publication Critical patent/CN1916889A/en
Application granted granted Critical
Publication of CN1916889B publication Critical patent/CN1916889B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

A preparing device of word library consists of word drawing out unit, calculation unit of word appearance frequency, calculation unit of correlation degree, word preparing unit and relation preparing unit. It is featured as using relation preparing unit to set up longitudinal contained relation structure in tree form for word obtained from word drawing out unit based on basic meaning between words.

Description

Language material storage preparation device and method thereof
Technical field
The present invention relates to the producing device and the method thereof of a kind of corpus (Corpus), more particularly, the present invention relates to a kind of producing device and the method thereof that can analyze the corpus (Corpus) of semantic relation, statistical dependence relation and similarity relation between word.
Background technology
Now, various information blend together that the mankind provide convenient, fast, effective information, have also brought such problem simultaneously, that is, and and organization and administration and finally effectively utilize these information how effectively.At present, Chang Yong information storage means have based on the method for dictionary with based on the method for knowledge base.
Corpus is the warehouse that is used for storing linguistic data, and its inner a large amount of linguistic data can be widely used in computer search, searches and analyze.
The method for making of existing corpus comprises the method based on dictionary.In the method, will come out with the corresponding to segmentation of words of the word in the dictinary information that is had in advance.Because the word major part that exists in the dictionary can correctly cut out, thus in corpus, seldom comprise the information that is not word, so can generate high-precision corpus.But, need store a large amount of storage spaces of dictionary based on the method for dictionary, so be unfavorable on portable set, using this method.Simultaneously, because the word that exists in the cutting dictionary only, so as special professional word or up-to-date word, can not cut out in the corpus as word information.In addition, in method, be difficult to quantize (quantization), thereby be difficult to it is applied in the middle of the digitizer about the information of the relation between the word based on dictionary.
Though the corpus according to the prior art structure respectively has characteristics, but its total weak point is that what generally deposit in the corpus all is word, and does not reflect the relation between the word, so the information that can provide is fewer, the application that should be able to provide just is restricted mutually.
Summary of the invention
At the problem that prior art exists, one of purpose of the present invention is to provide a kind of can analyze the producing device of the corpus of semantic relation, statistical dependence relation and similarity relation between word at finite space storage word as much as possible.
The producing device of corpus of the present invention, it is except comprising word extraction unit, frequency of occurrences calculating part, degree of association calculating part, language material storage preparation portion, it is characterized in that, this language material storage preparation device also comprises relation of inclusion preparing department, wherein, described word extraction unit is carried out cutting to training sample, obtains word sequence; Described relation of inclusion preparing department sets up vertical relation of inclusion structure to the word that the word extraction unit obtains with tree structure based on the semanteme between the word, and described frequency of occurrences calculating part calculates co-occurrence frequency and the co-occurrence distance between word; Described degree of association calculating part calculates the degree of correlation and similarity between word according to the result of described relation of inclusion preparing department and described frequency of occurrences calculating part; Described language material storage preparation portion deposits vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion.This vertical relation of inclusion structure is the relation of inclusion of the last subordinate concept of the semanteme between the word of representing to be stored.
In language material storage preparation device of the present invention, frequency of occurrences calculating part can be by the degree of correlation between the described word of following formula (1) calculating (that is co-occurrence weight ):
‾ rel w 1 w 2 = freq w 1 w 2 freq w 1 × γ ‾ dist w 1 w 2 + γ - - - ( 1 )
Wherein,
Figure DEST_PATH_RE-G200510093228001D00014
Represent the word w that described frequency of occurrences calculating part calculates respectively 1With word w 2Co-occurrence frequency, word w 1The frequency of occurrences and word w 1With word w 2Between average co-occurrence distance, γ is an adjustable parameter.In addition, in language material storage preparation device of the present invention, degree of association calculating part is pressed the similarity sim (w between above-mentioned two words of following formula (2) calculating 1, w 2):
sim(w 1,w 2)=αsim semantic(w 1,w 2)+βsim statistic(w 1,w 2) (2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, sim Semantic(w 1, w 2) expression word w 1With word w 2Semantic similarity, sim Statistic(w 1, w 2) the described word w of expression 1With word w 2Between the statistical dependence degree, α and β are adjustable parameter.
In addition, in language material storage preparation device of the present invention, degree of association calculating part can calculate described word w by following formula (3) 1With word w 2Semantic similarity sim Semantic(w 1, w 2):
sim semantic(w 1,w 2)=1/Dis semantic(w 1,w 2) (3)
Dis wherein Semantic(w 1, w 2) the word w that obtains in vertical relation of inclusion structure of constituting according to relation of inclusion preparing department of expression 1With word w 2Between bee-line.
Above-mentioned word w 1With word w 2Between statistical dependence degree sim Statistic(w 1, w 2) be word w 1With word w 2The degree of correlation
Figure DEST_PATH_RE-G200510093228001D00015
At this " word w that mentions 1With word w 2Bee-line Dis Semantic(w 1, w 2) " be meant, in concerning the vertical relation of inclusion structure of word that preparing department constitutes, word w 1With word w 2Between bee-line.
" word w 1Frequency of occurrences freqw 1" be meant word w 1(benchmark speech) concentrates the total number of times that occurs at training sample.
" co-occurrence " is meant window width
Figure S05193228020050830D000031
In, with regard to word w among the training sample L (L belongs to training sample and concentrates any sample) 1Certain appear as starting point, to thereafter Individual word is observed, and obtains set of words
Figure S05193228020050830D000033
If find speech Then say word w 1With word w 2At window width
Figure S05193228020050830D000035
Middle co-occurrence.
" co-occurrence frequency freqw 1, w 2" be meant word w 1With word w 2Concentrate the number of times that appears at simultaneously in certain default window width at training sample.
" co-occurrence is apart from dis W1w2) " be meant word w 1With word w 2Word w when in default window width, occurring simultaneously 2Apart from word w 1The position distance.
" average co-occurrence distance
Figure S05193228020050830D000036
Be meant, dis w 1 w 2 ‾ = Σ K = 1 freq w 1 w 2 ( dis w 1 w 2 ) k , Dis wherein W1w2) kExpression word w 1With word w 2The k time co-occurrence distance.
In addition, in language material storage preparation device of the present invention, frequency of occurrences calculating part can calculate word w by following formula (4) iConcern number k i:
k i = lgwfre q i lg wfreq ‾ × k , - - - ( 4 )
Wherein,
Figure S05193228020050830D000039
The average frequency of occurrences of representing all words in the described corpus, wfreq iExpression word w iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w iRelation sum surpass δ * k iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w iThe word w that concerns the weight minimum jReduce, this concerns that weights W eight (Relation) presses following formula (5) and calculates:
Weight ( Relation ) = freq w i w j × Weight w i w j - - - ( 5 )
Wherein, freq WiwjExpression word w iWith word w jCo-occurrence frequency, Weight WiwjExpression word w iWith word w jThe co-occurrence weight.
Another object of the present invention also is to provide a kind of language material storage preparation method.
This language material storage preparation method may further comprise the steps:
Step extracted out in word: the training sample content is carried out cutting, obtain word sequence;
Relation of inclusion making step:, the word that word extraction step obtains is set up vertical relation of inclusion structure with tree structure based on the semanteme between the word;
Frequency of occurrences calculation procedure: calculate the frequency of occurrences of word, co-occurrence frequency, co-occurrence distance and average co-occurrence distance between two words;
Degree of correlation similarity calculation procedure: according to the degree of correlation between two words of result of calculation calculating of described frequency of occurrences calculation procedure, and then according to vertical relation of inclusion structure of described relation of inclusion making step foundation and the similarity between two words of described relatedness computation;
Language material storage preparation step: the word that obtains in the above step, relation of inclusion, the degree of correlation and the similarity between them are constructed corpus as record.
According to language material storage preparation method of the present invention, in its degree of correlation similarity calculation procedure, the degree of correlation between two words
Figure DEST_PATH_RE-GSB00000092118300011
(that is co-occurrence weight
Figure DEST_PATH_RE-GSB00000092118300012
) can be calculated as follows:
rel w 1 w 2 ‾ = freq w 1 w 2 freq w 1 × γ dist w 1 w 1 ‾ + γ - - - ( 1 )
Wherein,
Figure DEST_PATH_RE-GSB00000092118300014
Represent the word w that obtains from frequency of occurrences calculation procedure respectively 1With word w 2Co-occurrence frequency, word w 1The frequency of occurrences and word w 1With word w 2Between average co-occurrence distance, γ is an adjustable parameter.
Above-mentioned word w 1With word w 2Semantic similarity sim Semantic(w 1, w 2) can calculate by following formula (3):
sim semantic(w 1,w 2)=1/Dis semantic(w 1,w 2) (3)
Dis wherein Semantic(w 1, w 2) the word w that obtains in described vertical relation of inclusion structure of setting up according to described relation of inclusion making step of expression 1With word w 2Between bee-line.
Above-mentioned word w 1With word w 2Between statistical dependence degree sim Statistic(w 1, w 2) be word w 1With word w 2The degree of correlation.
In addition, language material storage preparation method of the present invention can also comprise the reduction step: press following formula (4) and calculate word w iConcern number k i:
k i = lg wfreq i lg wfreq ‾ × k , - - - ( 4 )
Wherein, The average frequency of occurrences of all words in the expression corpus, wfreq iExpression word w iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w iRelation sum surpass δ * k iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w iThe word w that concerns the weight minimum jReduce, this concerns that weights W eight (Relation) presses following formula (5) and calculates:
Weight ( Relation ) = freq w i w j × Weight w i w j - - - ( 5 )
Wherein, freq WiwjExpression word w iWith word w jCo-occurrence frequency, Weight WiwjExpression word w iWith word w jThe co-occurrence weight
According to language material storage preparation device of the present invention and preparation method thereof a large amount of storage spaces of needs storages dictionary not, in stores words, not only the horizontal relationship between the stores words (statistical dependence relation) is analyzed, can also be simultaneously between the word vertically relation (semantic last subordinate concept relation of inclusion) analyze and laterally reach similarity between vertical relationship analysis word based on this.Promptly have vertical relation of inclusion structure, network of relation, similar network between word simultaneously according to the resultant corpus of language material storage preparation device of the present invention and preparation method thereof, therefore, the corpus that use is made according to the present invention not only can organically be organized various information, and be convenient to more according to user's requirement information be classified, in the data of magnanimity, find individual information of interest.Therefore, the corpus of making thus can be used in the application such as for example information retrieval, information extraction, training sample classification, the selection of intelligent television program.
In addition, according to language material storage preparation device of the present invention and preparation method thereof, when increase along with training sample, network of relation in the corpus constantly expands, because the present invention takes the reduction scheme that suits, make the burden of corpus physical space alleviate, to keep the efficient of degree of correlation similarity analysis between word storage and word.
In addition, according to language material storage preparation device of the present invention and preparation method thereof, because network of relation specific memory structure and the utilization of reducing algorithm, make the word of preserving in the corpus have the property of dynamically updating.Promptly, when having occurred existing word in the corpus in the training sample, new training sample might be introduced new relation for this word, when the relation sum of this word surpasses the reduction threshold value, just according to above-mentioned reduction scheme it is concerned reduction, thereby when introducing new relation, eliminate weak relation, the corpus that makes made can dynamically update according to training sample when keeping certain range of capacity.
Description of drawings
Fig. 1 is the structural representation of an embodiment of language material storage preparation device of the present invention;
Fig. 2 is the workflow diagram of the word extraction unit of this kind embodiment of the present invention;
Fig. 3 represents the vertical relation of inclusion structure between the word that is made of relation of inclusion preparing department of this kind embodiment of the present invention;
Fig. 4 is the base conditioning process flow diagram of the frequency of occurrences calculating part of this kind embodiment of the present invention;
Fig. 5 is the process flow diagram that the degree of association calculating part of this kind embodiment of the present invention calculates similarity;
Fig. 6 is the structural drawing of the resulting corpus of this kind embodiment of the present invention;
Fig. 7 is an example of the resulting vertical relation of inclusion structure of relation of inclusion preparing department of this kind embodiment of the present invention;
Fig. 8 is an example of the structural drawing of the resulting corpus of this kind embodiment of the present invention.
Embodiment
Hereinafter, the embodiment shown in reference to the accompanying drawings makes an explanation to the present invention.
Fig. 1 is the structural representation of an embodiment of language material storage preparation device of the present invention, wherein represents language material storage preparation device with Reference numeral 100.This language material storage preparation device 100 comprises word extraction unit 104, relation of inclusion preparing department 106, frequency of occurrences calculating part 108, degree of association calculating part 110, language material storage preparation portion 112.
Training sample 102 is divided into word sequence through word extraction unit 104, make vertical relation of inclusion between word via relation of inclusion preparing department 106 according to the relation of the last subordinate concept of the semanteme between word, calculate co-occurrence frequency and co-occurrence distance between word via frequency of occurrences calculating part 108, calculate the degree of correlation and similarity between word via degree of association calculating part 110, deposit vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion 114 by language material storage preparation portion 112 again.
Below will make detailed description to each part mentioned above.
Training sample 102 is meant the language material that is used to train, for example, and article.Language material is used to construct the degree of correlation network of corpus, its must possess language material big, contain wide, have certain authoritative condition, to guarantee and can carry out objective appraisal to the various algorithms of setting up thereon.
Word extraction unit 104 is mainly used to training sample 102 is carried out lexical analysis, by the natural language processing instrument content of training sample is carried out cutting, obtains word sequence.In Chinese information processing system, can adopt the method for self study that training sample is carried out cutting.This method for example can by the repeatedly iteration of EM (Expectation-Maximization) algorithm, finally obtain the best cutting result of training sample based on maximum likelihood principle.
Fig. 2 has provided the treatment scheme according to this method of this embodiment.The training sample 102 that reads in proposes legal character through the unallowable instruction digit processing module in the word extraction unit 104 204 and deposits in the interim training sample, come training sample is carried out cutting by the record of searching database with training sample cutting module 208 on the one hand then, utilize this sample that database is suitably upgraded by self-learning module 206 on the other hand.
Relation of inclusion preparing department 106 is used for making the vertical relation of inclusion structure between word.This vertical relation of inclusion structure is actually that relation of inclusion based on the last subordinate concept of the semanteme between the notion word obtains.Fig. 3 shows this vertical relation that characterizes with tree structure.We represent relation of inclusion between the node with set membership on such semantic tree.In other words, the word of father node (Fa_cnpt) representative is semantically comprising the word of child node (Son_cnpt) representative.Vertically the training key of relation of inclusion structure is to organize a semantic forest, and this semanteme forest has comprised a lot of semantic trees again.This needs philological knowledge, can obtain semantic tree by the method for synonymicon or expert's classification.In this embodiment, expert's classification (knowing net) has been used for reference in the foundation of semantic tree, and passes through the manual sort and obtain.
Like this, just constituted vertical relation of inclusion structure of corpus.
Frequency of occurrences calculating part 108 is used for calculating co-occurrence distance and the co-occurrence frequency between the word.The base conditioning flow process of frequency of occurrences calculating part 108 as shown in Figure 4.At first, frequency of occurrences calculating part 108 receives the result of word extraction unit 104, i.e. word sequence.The window that to preestablish a width be w is thought the co-occurrence frequency of these two words for once if certain two word occurs simultaneously in window, and be spaced apart the co-occurrence distance between two words.
Based on co-occurrence distance and the co-occurrence frequency between the word, frequency of occurrences calculating part 108 is by the co-occurrence weights W eight between following formula (1) the calculating word W1w2, that is the degree of correlation
Figure S05193228020050830D000071
re l w 1 w 2 ‾ = freq w 1 w 2 freq w 1 × γ dis t w 1 w 2 ‾ + γ - - - ( 1 )
Wherein, freq W1w2, freq W1, The word w that frequency computation part portion 108 calculates appears in expression respectively 1With word w 2Co-occurrence frequency, word w 1The frequency of occurrences and word w 1With word w 2Between average co-occurrence distance, γ is an adjustable parameter.
In addition, the relation in the corpus is a lot, and along with the increase of training sample, the network of relation in the corpus can constantly expand, and makes that the burden of physical space is quite heavy.Therefore need an expansion of reducing on its space of algorithm controls.As follows at the relation reduction algorithm that the frequency of occurrences calculating part shown in Fig. 4 108 adopts:
k i = lgwfre q i lg wfreq ‾ × k - - - ( 4 )
Wherein
Figure S05193228020050830D000082
Be the average frequency of occurrences of all words in the corpus, wfreq iBe word w iThe frequency of occurrences.K is the average relationship number of all words in the corpus.The process of reducing is a dynamic process, as the total threshold value δ * k that surpasses of the relation of certain word iWhen (δ is predefined greater than 1 cushioning coefficient), it is reduced.That reduces concerns the relation of weight minimum to liking those.Computing method are as follows:
Weight ( Relation ) = freq w i w j × Weight w i w j - - - ( 5 )
Freq WiwjExpression word w iWith word w jCo-occurrence frequency, Weight WiwjExpression word w iWith word w jThe co-occurrence weight.
Like this, constitute the network of relation of corpus based on the processing of above each several part.
Next, the similarity between 110 pairs of words of degree of association calculating part is calculated.Calculation of similarity degree is illustrated with reference to Fig. 5.At first, calculate and obtain two bee-line Dis between word according to the vertical relation of inclusion structure in the corpus Semantic(w1, w2) (502).Then, according to the bee-line Dis of gained Semantic(w1 w2) calculates with sim Semantic(w 1, w 2) expression word w 1With word w 2Semantic similarity (504).Then, the network of relation based on corpus calculates with sim Statistic(w 1, w 2) be word w 1With word w 2The degree of correlation (506).Then, according to step 504 and 506 gained results, by the similarity sim (w between two words of following formula (2) calculating 1, w 2):
sim(w 1,w 2)=αsim semantic(w 1,w 2)+βsim statistic(w 1,w 2)(2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, α and β are adjustable parameter.
Like this, just constitute the similar network of corpus by above processing.
Language material storage preparation portion 112 as input, is kept at vertical relation of inclusion structure of the corpus that is made of relation of inclusion preparing department, frequency of occurrences calculating part, degree of association calculating part and exports, network of relation, similar network in the corpus preservation portion 114.
Fig. 6 has provided the structural drawing of the corpus that finally obtains, and in Fig. 6, each node is represented a word, and wherein, the left side is vertical relation of inclusion, and the right is horizontal correlationship, and dotted line is represented the separately effect of expression of same node.In fact the node of dotted line connection is the different ingredients of same node.The part on the left side is similar to Fig. 3, does not repeat them here.Among the figure on the right, the connection of expression correlationship above, the connection of expression similarity relation below.Top correlationship is connected to relevant word and marks relevant frequency and distance.Following similarity relation is connected to similar word and marks similar degree.
The embodiment of language material storage preparation method of the present invention can be used the embodiment of language material storage preparation device of the present invention, realizes its word extraction step with Fig. 2, Fig. 3, Fig. 4, mode shown in Figure 5; The relation of inclusion making step; Frequency of occurrences calculation procedure; Degree of correlation similarity calculation procedure; And the language material storage preparation step, have the corpus of vertical relation of inclusion structure between word, network of relation, similar network when obtaining as shown in Figure 6.
(embodiment)
Below specify the flow process of language material storage preparation of the present invention with an example.
In this example, training sample adopts one section following article:
Europe man's gymnastics match closing
The 19 the European man's gymnastics championship in Xinhua News Agency Lausanne May 27 (reporter Shi Guangyao) contended through 3 days, and finish at Lausanne, SUI afternoon on the 27th.The good strong wind of Soviet Union player seizes 6 pieces (1 piece side by side) in whole 8 pieces of gold medals.Soviet Union star Mo Jilini obtains individual all-round competition, pommel horse and 3 pieces of gold medals of parallel bars (side by side), Suo Heerbo gain freedom gymnastics, horse-vaulting and three champions of horizontal bar.Sharp Buddhist nun of Switzerland player Ji Beier and Italian player Kai Ji obtain parallel bars and champion of the swinging rings respectively.73 sportsmen from 25 European countries have participated in current match.(End)
Word extraction unit 104 is utilized segmentation of words instrument, and the content of one piece of article is cut into one by one independently word, wherein mainly extracts noun out.The result is as follows in output:
Europe man's gymnastics match Lausanne reporter of Xinhua News Agency shines European man's gymnastics championship and contends the Lausanne, SUI curtain Soviet Union player strong wind gold medal Soviet Union star individual all-round competition pommel horse parallel bars gold medal horse-vaulting horizontal bar champion Switzerland player Bel Italy player parallel bars champion of the swinging rings sportsman of European countries match
Relation of inclusion preparing department 106 uses for reference expert's classification (knowing net) output result as shown in Figure 7 simultaneously according to the output of word extraction unit, promptly vertical relation of inclusion structure.
Frequency of occurrences calculating part 108 receives the set of the word of cutting, is in the window of w by predefined width, and pieces of training sample is scanned.If certain two word occurs in window simultaneously, then think these two word co-occurrences once, be spaced apart the co-occurrence distance between two words.Through statistics, obtain with each speech as keyword the average co-occurrence distance and the co-occurrence frequency of the word of other relevant with this keyword.
In following table, " KEY " represents keyword; " REL_NODE " expression interdependent node " frequency " expression co-occurrence frequency; Ave_dis represents average co-occurrence distance.KEY: man
REL_NODE[1]=gymnastics match ave_dis=1.000000 frequency=1
REL_NODE[2]=the ave_dis=2.000000 frequency=1 of Xinhua News Agency
REL_NODE[3]=Lausanne ave_dis=4.000000 frequency=2
REL_NODE[4]=reporter ave_dis=4.000000 frequency=1
REL_NODE[5]=shine ave_dis=5.000000 frequency=1
REL_NODE[6]=gymnastics ave_dis=1.000000 frequency=1
REL_NODE[7]=championship ave_dis=2.000000 frequency=1
REL_NODE[8]=contention ave_dis=3.000000 frequency=1
REL_NODE[9]=Switzerland ave_dis=4.000000 frequency=1
As can be seen from the above table, for example, the average co-occurrence distance of " man " and " reporter " is 4.000000, and co-occurrence frequency is 1.
Degree of association calculating part 110 calculates the degree of correlation between two words and the similarity between two words.At first average co-occurrence distance between the word that obtains according to frequency of occurrences calculating part statistics and co-occurrence frequency calculate the degree of correlation between word and the word:
re l w 1 w 2 ‾ = freq w 1 w 2 freq w 1 × γ dis t w 1 w 2 ‾ + γ - - - ( 1 )
Wherein,
Figure S05193228020050830D000102
Be the mean distance ave_dis in the table, γ is an adjustable parameter.So just can obtain the degree of correlation between two speech.
If get γ=0.5, as above the mean distance of " man " and " reporter " is 4.000000, and co-occurrence frequency is 1, if the frequency of occurrences of this moment " man " is 10, and the degree of correlation of two speech then
re l w 1 w 2 ‾ = 1 10 × 0.5 4.0 + 0.5 = 0.01111
According to similarity calculating method mentioned above, based on the vertical relation of inclusion structure and the degree of correlation network of above corpus, use formula (2) is calculated the similarity in this example:
sim(w 1,w 2)=αsim semantic(w 1,w 2)+βsim statistic(w 1,w 2)(2)
α, β ∈ (0,1) and alpha+beta=1
In vertical relation of inclusion, for example, if two words have set membership, bee-line Dis therebetween Semantic(w 1, w 2) can be taken as 2, its semantic similarity sim Semantic(w 1, w 2) be 0.5.Sim Statistic(w 1, w 2) expression w 1And w 2The statistical dependence degree, be the degree of correlation that following formula (1) calculates gained
Figure S05193228020050830D000111
For example, get α=0.4 at this, the similarity between " man " and " reporter " is calculated in β=0.6.Owing to do not have set membership, sim between " man " and " reporter " Semantic(w 1, w 2) be 0, sim Statistic(w 1, w 2) be 0.01111 of formula (1) calculating gained.Then the similarity between two speech obtains with formula (2), sim (W 1, W 2)=0.4 * 0+0.6 * 0.01111=0.006666
The record that will meet the corpus structure thus by language material storage preparation portion 112, for example, the key words of statistics gained as " man ", related words, is kept in the corpus preservation portion 114 as " reporter " and their similarity " 0.006666 " form with a data-base recording above.The corpus structure that constitutes according to this example as shown in Figure 8.Like this, comprise word, key words has constituted corpus with the record of vertical relation of inclusion structure, its corresponding network of relation and the similar network of related words.When the degree of correlation of two words of needs or similarity information, just from this corpus, read.
In addition, start the reduction condition (that is: as adding a new relation (newrelation) " man and football " this moment, satisfying lg wfreq i lg wfreq &OverBar; &times; k i &times; &delta; < 10 ), window width w=6, γ=3, and this moment ave_dis=1, frequency=1,, then by formula (5) the weight of above-mentioned new relation: Weight ( newrelation ) = freq w 1 w 2 &times; weight w 1 w 2 = freq w 1 w 2 &times; freq w 1 w 2 freq w 1 &times; &gamma; dist w 1 w 2 &OverBar; + &gamma; = 1 &times; 1 / 10 &times; ( 3 / ( 1 + 3 ) ) = 0.075 , And the pass of this moment " man " and " Switzerland " is Weight (relation)=1 * 1/10 * (3/ (4+3))=0.043, so this relation will be reduced, new relation " man " and " football " then are added into corpus.Thus, make corpus obtain upgrading.
Above-mentioned example is just in order to illustrate the example of embodiment of the present invention, and the present invention also can adopt other implementation of modification to carry out.Language material storage preparation device can be that core devices constitutes with the processor.The corpus of making can be realized with memory device commonly used such as hard disk, disk.
More than language material storage preparation device of the present invention and method thereof have been done detailed explanation.Modification that those skilled in the art are made within the spirit and scope of the present invention and improvement should be included in the appended claim restricted portion of the present invention.

Claims (12)

1. language material storage preparation device that comprises word extraction unit, frequency of occurrences calculating part, degree of association calculating part, language material storage preparation portion, it is characterized in that: this language material storage preparation device also comprises relation of inclusion preparing department, wherein,
Described word extraction unit is carried out cutting to training sample, obtains word sequence;
Described relation of inclusion preparing department sets up vertical relation of inclusion structure to the speech that described word extraction unit obtains with tree structure based on the semanteme between the word;
Described frequency of occurrences calculating part calculates co-occurrence frequency and the co-occurrence distance between word;
Described degree of association calculating part calculates the degree of correlation between word according to the result of calculation of described frequency of occurrences calculating part, and then the vertical relation of inclusion structure set up according to described relation of inclusion preparing department and the similarity between described relatedness computation word;
Described language material storage preparation portion deposits vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion.
2. language material storage preparation device according to claim 1 is characterized in that, described frequency of occurrences calculating part is pressed the degree of correlation between the described word of following formula (1) calculating
Figure FSB00000092118200011
rel w 1 w 2 &OverBar; = freq w 1 w 2 freq w 1 &times; &gamma; dist w 1 w 2 &OverBar; + &gamma; - - - ( 1 )
Wherein,
Figure FSB00000092118200013
Represent the word w that described frequency of occurrences calculating part obtains respectively 1With word w 2Co-occurrence frequency, word w 1The frequency of occurrences and word w 1With word w 2Between average co-occurrence distance, γ is an adjustable parameter.
3. language material storage preparation device according to claim 2 is characterized in that: described degree of association calculating part is pressed the similarity sim (w between two words of following formula (2) calculating 1, w 2):
sim(w 1,w 2)=αsim semantic(w 1,w 2)+βsim statistic(w 1,w 2) (2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, sim Semantic(w 1, w 2) expression word w 1With word w 2Semantic similarity, sim Statistic(w 1, w 2) the described word w of expression 1With word w 2Between the statistical dependence degree, α and β are adjustable parameter.
4. language material storage preparation device according to claim 3 is characterized in that: described degree of association calculating part is pressed following formula (3) and is calculated described word w 1With word w 2Semantic similarity sim Semantic(w 1, w 2):
sim semantic(w 1,w 2)=1/Dis semantic(w 1,w 2) (3)
Dis wherein Semantic(w 1, w 2) the word w that obtains in described vertical relation of inclusion structure of constituting according to described relation of inclusion preparing department of expression 1With word w 2Between bee-line.
5. language material storage preparation device according to claim 3 is characterized in that: described word w 1With word w 2Between statistical dependence degree sim Statistic(w 1, w 2) be described word w 1With word w 2The degree of correlation
6. language material storage preparation device according to claim 1 is characterized in that: at described frequency of occurrences calculating part, press following formula (4) and calculate word w iConcern number k i:
k i = lg wfreq i lg wfreq &OverBar; &times; k , - - - ( 4 )
Wherein,
Figure FSB00000092118200023
The average frequency of occurrences of representing all words in the described corpus, wfreq iExpression word w iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w iRelation sum surpass threshold value δ * k iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w iThe word w that concerns the weight minimum jReduce, the described weights W eight (Relation) that concerns presses following formula (5) calculating:
Weight ( Relation ) = freq w i w j &times; Weight w i w j - - - ( 5 )
Wherein, Expression word w iWith word w jCo-occurrence frequency, Expression word w iWith word w jThe co-occurrence weight.
7. language material storage preparation method is characterized in that: may further comprise the steps:
Word extraction step: the training sample content is carried out cutting, obtain word sequence;
Relation of inclusion making step:, the word that word extraction step obtains is set up vertical relation of inclusion structure with tree structure based on the semanteme between the word;
Frequency of occurrences calculation procedure: calculate the frequency of occurrences of word, co-occurrence frequency, co-occurrence distance and average co-occurrence distance between two words;
Degree of correlation similarity calculation procedure: according to the degree of correlation between two words of result of calculation calculating of described frequency of occurrences calculation procedure, and then according to vertical relation of inclusion structure of described relation of inclusion making step foundation and the similarity between two words of described relatedness computation;
Language material storage preparation step: the word that obtains in the above step, vertical relation of inclusion structure, the degree of correlation and the similarity between them are constructed corpus as record.
8. language material storage preparation method according to claim 7 is characterized in that: in described degree of correlation similarity calculation procedure, be calculated as follows the degree of correlation between described two words
Figure FSB00000092118200031
rel w 1 w 2 &OverBar; = freq w 1 w 2 freq w 1 &times; &gamma; dist w 1 w 2 &OverBar; + &gamma; - - - ( 1 )
Wherein,
Figure FSB00000092118200033
Represent the word w that obtains from described frequency of occurrences calculation procedure respectively 1With word w 2Co-occurrence frequency, word w 1The frequency of occurrences and word w 1With word w 2Between average co-occurrence distance, γ is an adjustable parameter.
9. language material storage preparation method according to claim 8 is characterized in that: in described degree of correlation similarity calculation procedure, as shown in the formula the similarity sim (w between described two words of (2) calculating 1, w 2):
sim(w 1,w 2)=αsim semantic(w 1,w 2)+βsim statistic(w 1,w 2) (2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, sim Semantic(w 1, w 2) expression word w 1With word w 2Semantic similarity, sim Statistic(w 1, w 2) be described word w 1With word w 2Between the statistical dependence degree, α and β are adjustable parameter.
10. language material storage preparation method according to claim 9 is characterized in that: described word w 1With word w 2Semantic similarity sim Semantic(w 1, w 2) calculate by following formula (3):
sim semantic(w 1,w 2)=1/Dis semantic(w 1,w 2) (3)
Dis wherein Semantic(w 1, w 2) the word w that obtains in described vertical relation of inclusion structure of setting up according to described relation of inclusion making step of expression 1With word w 2Between bee-line.
11. language material storage preparation method according to claim 9 is characterized in that: described word w 1With word w 2Between statistical dependence degree sim Statistic(w 1, w 2) be described word w 1With word w 2Between the degree of correlation
12. language material storage preparation method according to claim 7 is characterized in that, also comprises the reduction step: press following formula (4) and calculate word w iConcern number k i:
k i = lg wfreq i lg wfreq &OverBar; &times; k , - - - ( 4 )
Wherein,
Figure FSB00000092118200043
The average frequency of occurrences of representing all words in the described corpus, wfreq iExpression word w iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w iRelation sum surpass threshold value δ * k iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w iThe word w that concerns the weight minimum jReduce, the described weights W eight (Relation) that concerns presses following formula (5) calculating:
Weight ( Relation ) = freq w i w j &times; Weight w i w j - - - ( 5 )
Wherein,
Figure FSB00000092118200045
Expression word w iWith word w jCo-occurrence frequency, Expression word w iWith word w jThe co-occurrence weight.
CN2005100932280A 2005-08-19 2005-08-19 Language material storage preparation device and its method Expired - Fee Related CN1916889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2005100932280A CN1916889B (en) 2005-08-19 2005-08-19 Language material storage preparation device and its method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2005100932280A CN1916889B (en) 2005-08-19 2005-08-19 Language material storage preparation device and its method

Publications (2)

Publication Number Publication Date
CN1916889A CN1916889A (en) 2007-02-21
CN1916889B true CN1916889B (en) 2011-02-02

Family

ID=37737887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2005100932280A Expired - Fee Related CN1916889B (en) 2005-08-19 2005-08-19 Language material storage preparation device and its method

Country Status (1)

Country Link
CN (1) CN1916889B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5382651B2 (en) * 2009-09-09 2014-01-08 独立行政法人情報通信研究機構 Word pair acquisition device, word pair acquisition method, and program
CN102591862A (en) * 2011-01-05 2012-07-18 华东师范大学 Control method and device of Chinese entity relationship extraction based on word co-occurrence
CN102609424B (en) * 2011-01-21 2014-10-08 日电(中国)有限公司 Method and equipment for extracting assessment information
CN104077295A (en) * 2013-03-27 2014-10-01 百度在线网络技术(北京)有限公司 Data label mining method and data label mining system
CN105608083B (en) * 2014-11-13 2019-09-03 北京搜狗科技发展有限公司 Obtain the method, apparatus and electronic equipment of input magazine
US10198471B2 (en) * 2015-05-31 2019-02-05 Microsoft Technology Licensing, Llc Joining semantically-related data using big table corpora
CN106202311B (en) * 2016-06-30 2020-03-10 北京奇艺世纪科技有限公司 File clustering method and device
CN106202380B (en) * 2016-07-08 2019-12-24 中国科学院上海高等研究院 Method and system for constructing classified corpus and server with system
CN108197120A (en) * 2017-12-28 2018-06-22 中译语通科技(青岛)有限公司 A kind of similar sentence machining system based on bilingual teaching mode
CN110321404B (en) * 2019-07-10 2021-08-10 北京麒才教育科技有限公司 Vocabulary entry selection method and device for vocabulary learning, electronic equipment and storage medium
CN110334215B (en) * 2019-07-10 2021-08-10 北京麒才教育科技有限公司 Construction method and device of vocabulary learning framework, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1110882A (en) * 1993-06-18 1995-10-25 欧洲佳能研究中心有限公司 Methods and apparatuses for processing a bilingual database
CN1116342A (en) * 1994-07-08 1996-02-07 唐武 Chinese automatic proofreading method and system thereof
CN1387651A (en) * 1999-11-05 2002-12-25 微软公司 System and iterative method for lexicon, segmentation and language model joint optimization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1110882A (en) * 1993-06-18 1995-10-25 欧洲佳能研究中心有限公司 Methods and apparatuses for processing a bilingual database
CN1116342A (en) * 1994-07-08 1996-02-07 唐武 Chinese automatic proofreading method and system thereof
CN1387651A (en) * 1999-11-05 2002-12-25 微软公司 System and iterative method for lexicon, segmentation and language model joint optimization

Also Published As

Publication number Publication date
CN1916889A (en) 2007-02-21

Similar Documents

Publication Publication Date Title
CN1916889B (en) Language material storage preparation device and its method
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
CN110059311A (en) A kind of keyword extracting method and system towards judicial style data
CN108197111A (en) A kind of text automatic abstracting method based on fusion Semantic Clustering
CN106257441B (en) A kind of training method of the skip language model based on word frequency
US8560485B2 (en) Generating a domain corpus and a dictionary for an automated ontology
US8200671B2 (en) Generating a dictionary and determining a co-occurrence context for an automated ontology
CN101286161A (en) Intelligent Chinese request-answering system based on concept
CN104866496A (en) Method and device for determining morpheme significance analysis model
Mackenzie et al. Efficiency implications of term weighting for passage retrieval
CN100535895C (en) Test search apparatus and method
CN104484380A (en) Personalized search method and personalized search device
CN116050397B (en) Method, system, equipment and storage medium for generating long text abstract
CN106294733A (en) Page detection method based on text analyzing
CN110083696A (en) Global quotation recommended method, recommender system based on meta structure technology
CN101187919A (en) Method and system for abstracting batch single document for document set
Trabelsi et al. Improved table retrieval using multiple context embeddings for attributes
CN1916904A (en) Method of abstracting single file based on expansion of file
CN114138931A (en) Mathematical formula perception indexing and ranking method, storage medium and equipment
El Mahdaouy et al. Semantically enhanced term frequency based on word embeddings for Arabic information retrieval
Amini Interactive learning for text summarization
Zhou et al. Query expansion for personalized cross-language information retrieval
CN117131383A (en) Method for improving search precision drainage performance of double-tower model
CN114580557A (en) Document similarity determination method and device based on semantic analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110202

Termination date: 20180819