CN1916889B - Language material storage preparation device and its method - Google Patents
Language material storage preparation device and its method Download PDFInfo
- Publication number
- CN1916889B CN1916889B CN2005100932280A CN200510093228A CN1916889B CN 1916889 B CN1916889 B CN 1916889B CN 2005100932280 A CN2005100932280 A CN 2005100932280A CN 200510093228 A CN200510093228 A CN 200510093228A CN 1916889 B CN1916889 B CN 1916889B
- Authority
- CN
- China
- Prior art keywords
- word
- relation
- frequency
- degree
- occurrences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
A preparing device of word library consists of word drawing out unit, calculation unit of word appearance frequency, calculation unit of correlation degree, word preparing unit and relation preparing unit. It is featured as using relation preparing unit to set up longitudinal contained relation structure in tree form for word obtained from word drawing out unit based on basic meaning between words.
Description
Technical field
The present invention relates to the producing device and the method thereof of a kind of corpus (Corpus), more particularly, the present invention relates to a kind of producing device and the method thereof that can analyze the corpus (Corpus) of semantic relation, statistical dependence relation and similarity relation between word.
Background technology
Now, various information blend together that the mankind provide convenient, fast, effective information, have also brought such problem simultaneously, that is, and and organization and administration and finally effectively utilize these information how effectively.At present, Chang Yong information storage means have based on the method for dictionary with based on the method for knowledge base.
Corpus is the warehouse that is used for storing linguistic data, and its inner a large amount of linguistic data can be widely used in computer search, searches and analyze.
The method for making of existing corpus comprises the method based on dictionary.In the method, will come out with the corresponding to segmentation of words of the word in the dictinary information that is had in advance.Because the word major part that exists in the dictionary can correctly cut out, thus in corpus, seldom comprise the information that is not word, so can generate high-precision corpus.But, need store a large amount of storage spaces of dictionary based on the method for dictionary, so be unfavorable on portable set, using this method.Simultaneously, because the word that exists in the cutting dictionary only, so as special professional word or up-to-date word, can not cut out in the corpus as word information.In addition, in method, be difficult to quantize (quantization), thereby be difficult to it is applied in the middle of the digitizer about the information of the relation between the word based on dictionary.
Though the corpus according to the prior art structure respectively has characteristics, but its total weak point is that what generally deposit in the corpus all is word, and does not reflect the relation between the word, so the information that can provide is fewer, the application that should be able to provide just is restricted mutually.
Summary of the invention
At the problem that prior art exists, one of purpose of the present invention is to provide a kind of can analyze the producing device of the corpus of semantic relation, statistical dependence relation and similarity relation between word at finite space storage word as much as possible.
The producing device of corpus of the present invention, it is except comprising word extraction unit, frequency of occurrences calculating part, degree of association calculating part, language material storage preparation portion, it is characterized in that, this language material storage preparation device also comprises relation of inclusion preparing department, wherein, described word extraction unit is carried out cutting to training sample, obtains word sequence; Described relation of inclusion preparing department sets up vertical relation of inclusion structure to the word that the word extraction unit obtains with tree structure based on the semanteme between the word, and described frequency of occurrences calculating part calculates co-occurrence frequency and the co-occurrence distance between word; Described degree of association calculating part calculates the degree of correlation and similarity between word according to the result of described relation of inclusion preparing department and described frequency of occurrences calculating part; Described language material storage preparation portion deposits vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion.This vertical relation of inclusion structure is the relation of inclusion of the last subordinate concept of the semanteme between the word of representing to be stored.
In language material storage preparation device of the present invention, frequency of occurrences calculating part can be by the degree of correlation between the described word of following formula (1) calculating
(that is co-occurrence weight
):
Wherein,
Represent the word w that described frequency of occurrences calculating part calculates respectively
1With word w
2Co-occurrence frequency, word w
1The frequency of occurrences and word w
1With word w
2Between average co-occurrence distance, γ is an adjustable parameter.In addition, in language material storage preparation device of the present invention, degree of association calculating part is pressed the similarity sim (w between above-mentioned two words of following formula (2) calculating
1, w
2):
sim(w
1,w
2)=αsim
semantic(w
1,w
2)+βsim
statistic(w
1,w
2) (2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, sim
Semantic(w
1, w
2) expression word w
1With word w
2Semantic similarity, sim
Statistic(w
1, w
2) the described word w of expression
1With word w
2Between the statistical dependence degree, α and β are adjustable parameter.
In addition, in language material storage preparation device of the present invention, degree of association calculating part can calculate described word w by following formula (3)
1With word w
2Semantic similarity sim
Semantic(w
1, w
2):
sim
semantic(w
1,w
2)=1/Dis
semantic(w
1,w
2) (3)
Dis wherein
Semantic(w
1, w
2) the word w that obtains in vertical relation of inclusion structure of constituting according to relation of inclusion preparing department of expression
1With word w
2Between bee-line.
Above-mentioned word w
1With word w
2Between statistical dependence degree sim
Statistic(w
1, w
2) be word w
1With word w
2The degree of correlation
At this " word w that mentions
1With word w
2Bee-line Dis
Semantic(w
1, w
2) " be meant, in concerning the vertical relation of inclusion structure of word that preparing department constitutes, word w
1With word w
2Between bee-line.
" word w
1Frequency of occurrences freqw
1" be meant word w
1(benchmark speech) concentrates the total number of times that occurs at training sample.
" co-occurrence " is meant window width
In, with regard to word w among the training sample L (L belongs to training sample and concentrates any sample)
1Certain appear as starting point, to thereafter
Individual word is observed, and obtains set of words
If find speech
Then say word w
1With word w
2At window width
Middle co-occurrence.
" co-occurrence frequency freqw
1, w
2" be meant word w
1With word w
2Concentrate the number of times that appears at simultaneously in certain default window width at training sample.
" co-occurrence is apart from dis
W1w2) " be meant word w
1With word w
2Word w when in default window width, occurring simultaneously
2Apart from word w
1The position distance.
" average co-occurrence distance
Be meant,
Dis wherein
W1w2)
kExpression word w
1With word w
2The k time co-occurrence distance.
In addition, in language material storage preparation device of the present invention, frequency of occurrences calculating part can calculate word w by following formula (4)
iConcern number k
i:
Wherein,
The average frequency of occurrences of representing all words in the described corpus, wfreq
iExpression word w
iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w
iRelation sum surpass δ * k
iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w
iThe word w that concerns the weight minimum
jReduce, this concerns that weights W eight (Relation) presses following formula (5) and calculates:
Wherein, freq
WiwjExpression word w
iWith word w
jCo-occurrence frequency, Weight
WiwjExpression word w
iWith word w
jThe co-occurrence weight.
Another object of the present invention also is to provide a kind of language material storage preparation method.
This language material storage preparation method may further comprise the steps:
Step extracted out in word: the training sample content is carried out cutting, obtain word sequence;
Relation of inclusion making step:, the word that word extraction step obtains is set up vertical relation of inclusion structure with tree structure based on the semanteme between the word;
Frequency of occurrences calculation procedure: calculate the frequency of occurrences of word, co-occurrence frequency, co-occurrence distance and average co-occurrence distance between two words;
Degree of correlation similarity calculation procedure: according to the degree of correlation between two words of result of calculation calculating of described frequency of occurrences calculation procedure, and then according to vertical relation of inclusion structure of described relation of inclusion making step foundation and the similarity between two words of described relatedness computation;
Language material storage preparation step: the word that obtains in the above step, relation of inclusion, the degree of correlation and the similarity between them are constructed corpus as record.
According to language material storage preparation method of the present invention, in its degree of correlation similarity calculation procedure, the degree of correlation between two words
(that is co-occurrence weight
) can be calculated as follows:
Wherein,
Represent the word w that obtains from frequency of occurrences calculation procedure respectively
1With word w
2Co-occurrence frequency, word w
1The frequency of occurrences and word w
1With word w
2Between average co-occurrence distance, γ is an adjustable parameter.
Above-mentioned word w
1With word w
2Semantic similarity sim
Semantic(w
1, w
2) can calculate by following formula (3):
sim
semantic(w
1,w
2)=1/Dis
semantic(w
1,w
2) (3)
Dis wherein
Semantic(w
1, w
2) the word w that obtains in described vertical relation of inclusion structure of setting up according to described relation of inclusion making step of expression
1With word w
2Between bee-line.
Above-mentioned word w
1With word w
2Between statistical dependence degree sim
Statistic(w
1, w
2) be word w
1With word w
2The degree of correlation.
In addition, language material storage preparation method of the present invention can also comprise the reduction step: press following formula (4) and calculate word w
iConcern number k
i:
Wherein,
The average frequency of occurrences of all words in the expression corpus, wfreq
iExpression word w
iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w
iRelation sum surpass δ * k
iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w
iThe word w that concerns the weight minimum
jReduce, this concerns that weights W eight (Relation) presses following formula (5) and calculates:
Wherein, freq
WiwjExpression word w
iWith word w
jCo-occurrence frequency, Weight
WiwjExpression word w
iWith word w
jThe co-occurrence weight
According to language material storage preparation device of the present invention and preparation method thereof a large amount of storage spaces of needs storages dictionary not, in stores words, not only the horizontal relationship between the stores words (statistical dependence relation) is analyzed, can also be simultaneously between the word vertically relation (semantic last subordinate concept relation of inclusion) analyze and laterally reach similarity between vertical relationship analysis word based on this.Promptly have vertical relation of inclusion structure, network of relation, similar network between word simultaneously according to the resultant corpus of language material storage preparation device of the present invention and preparation method thereof, therefore, the corpus that use is made according to the present invention not only can organically be organized various information, and be convenient to more according to user's requirement information be classified, in the data of magnanimity, find individual information of interest.Therefore, the corpus of making thus can be used in the application such as for example information retrieval, information extraction, training sample classification, the selection of intelligent television program.
In addition, according to language material storage preparation device of the present invention and preparation method thereof, when increase along with training sample, network of relation in the corpus constantly expands, because the present invention takes the reduction scheme that suits, make the burden of corpus physical space alleviate, to keep the efficient of degree of correlation similarity analysis between word storage and word.
In addition, according to language material storage preparation device of the present invention and preparation method thereof, because network of relation specific memory structure and the utilization of reducing algorithm, make the word of preserving in the corpus have the property of dynamically updating.Promptly, when having occurred existing word in the corpus in the training sample, new training sample might be introduced new relation for this word, when the relation sum of this word surpasses the reduction threshold value, just according to above-mentioned reduction scheme it is concerned reduction, thereby when introducing new relation, eliminate weak relation, the corpus that makes made can dynamically update according to training sample when keeping certain range of capacity.
Description of drawings
Fig. 1 is the structural representation of an embodiment of language material storage preparation device of the present invention;
Fig. 2 is the workflow diagram of the word extraction unit of this kind embodiment of the present invention;
Fig. 3 represents the vertical relation of inclusion structure between the word that is made of relation of inclusion preparing department of this kind embodiment of the present invention;
Fig. 4 is the base conditioning process flow diagram of the frequency of occurrences calculating part of this kind embodiment of the present invention;
Fig. 5 is the process flow diagram that the degree of association calculating part of this kind embodiment of the present invention calculates similarity;
Fig. 6 is the structural drawing of the resulting corpus of this kind embodiment of the present invention;
Fig. 7 is an example of the resulting vertical relation of inclusion structure of relation of inclusion preparing department of this kind embodiment of the present invention;
Fig. 8 is an example of the structural drawing of the resulting corpus of this kind embodiment of the present invention.
Embodiment
Hereinafter, the embodiment shown in reference to the accompanying drawings makes an explanation to the present invention.
Fig. 1 is the structural representation of an embodiment of language material storage preparation device of the present invention, wherein represents language material storage preparation device with Reference numeral 100.This language material storage preparation device 100 comprises word extraction unit 104, relation of inclusion preparing department 106, frequency of occurrences calculating part 108, degree of association calculating part 110, language material storage preparation portion 112.
Below will make detailed description to each part mentioned above.
Word extraction unit 104 is mainly used to training sample 102 is carried out lexical analysis, by the natural language processing instrument content of training sample is carried out cutting, obtains word sequence.In Chinese information processing system, can adopt the method for self study that training sample is carried out cutting.This method for example can by the repeatedly iteration of EM (Expectation-Maximization) algorithm, finally obtain the best cutting result of training sample based on maximum likelihood principle.
Fig. 2 has provided the treatment scheme according to this method of this embodiment.The training sample 102 that reads in proposes legal character through the unallowable instruction digit processing module in the word extraction unit 104 204 and deposits in the interim training sample, come training sample is carried out cutting by the record of searching database with training sample cutting module 208 on the one hand then, utilize this sample that database is suitably upgraded by self-learning module 206 on the other hand.
Relation of inclusion preparing department 106 is used for making the vertical relation of inclusion structure between word.This vertical relation of inclusion structure is actually that relation of inclusion based on the last subordinate concept of the semanteme between the notion word obtains.Fig. 3 shows this vertical relation that characterizes with tree structure.We represent relation of inclusion between the node with set membership on such semantic tree.In other words, the word of father node (Fa_cnpt) representative is semantically comprising the word of child node (Son_cnpt) representative.Vertically the training key of relation of inclusion structure is to organize a semantic forest, and this semanteme forest has comprised a lot of semantic trees again.This needs philological knowledge, can obtain semantic tree by the method for synonymicon or expert's classification.In this embodiment, expert's classification (knowing net) has been used for reference in the foundation of semantic tree, and passes through the manual sort and obtain.
Like this, just constituted vertical relation of inclusion structure of corpus.
Frequency of occurrences calculating part 108 is used for calculating co-occurrence distance and the co-occurrence frequency between the word.The base conditioning flow process of frequency of occurrences calculating part 108 as shown in Figure 4.At first, frequency of occurrences calculating part 108 receives the result of word extraction unit 104, i.e. word sequence.The window that to preestablish a width be w is thought the co-occurrence frequency of these two words for once if certain two word occurs simultaneously in window, and be spaced apart the co-occurrence distance between two words.
Based on co-occurrence distance and the co-occurrence frequency between the word, frequency of occurrences calculating part 108 is by the co-occurrence weights W eight between following formula (1) the calculating word
W1w2, that is the degree of correlation
Wherein, freq
W1w2, freq
W1,
The word w that frequency computation part portion 108 calculates appears in expression respectively
1With word w
2Co-occurrence frequency, word w
1The frequency of occurrences and word w
1With word w
2Between average co-occurrence distance, γ is an adjustable parameter.
In addition, the relation in the corpus is a lot, and along with the increase of training sample, the network of relation in the corpus can constantly expand, and makes that the burden of physical space is quite heavy.Therefore need an expansion of reducing on its space of algorithm controls.As follows at the relation reduction algorithm that the frequency of occurrences calculating part shown in Fig. 4 108 adopts:
Wherein
Be the average frequency of occurrences of all words in the corpus, wfreq
iBe word w
iThe frequency of occurrences.K is the average relationship number of all words in the corpus.The process of reducing is a dynamic process, as the total threshold value δ * k that surpasses of the relation of certain word
iWhen (δ is predefined greater than 1 cushioning coefficient), it is reduced.That reduces concerns the relation of weight minimum to liking those.Computing method are as follows:
Freq
WiwjExpression word w
iWith word w
jCo-occurrence frequency, Weight
WiwjExpression word w
iWith word w
jThe co-occurrence weight.
Like this, constitute the network of relation of corpus based on the processing of above each several part.
Next, the similarity between 110 pairs of words of degree of association calculating part is calculated.Calculation of similarity degree is illustrated with reference to Fig. 5.At first, calculate and obtain two bee-line Dis between word according to the vertical relation of inclusion structure in the corpus
Semantic(w1, w2) (502).Then, according to the bee-line Dis of gained
Semantic(w1 w2) calculates with sim
Semantic(w
1, w
2) expression word w
1With word w
2Semantic similarity (504).Then, the network of relation based on corpus calculates with sim
Statistic(w
1, w
2) be word w
1With word w
2The degree of correlation (506).Then, according to step 504 and 506 gained results, by the similarity sim (w between two words of following formula (2) calculating
1, w
2):
sim(w
1,w
2)=αsim
semantic(w
1,w
2)+βsim
statistic(w
1,w
2)(2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, α and β are adjustable parameter.
Like this, just constitute the similar network of corpus by above processing.
Language material storage preparation portion 112 as input, is kept at vertical relation of inclusion structure of the corpus that is made of relation of inclusion preparing department, frequency of occurrences calculating part, degree of association calculating part and exports, network of relation, similar network in the corpus preservation portion 114.
Fig. 6 has provided the structural drawing of the corpus that finally obtains, and in Fig. 6, each node is represented a word, and wherein, the left side is vertical relation of inclusion, and the right is horizontal correlationship, and dotted line is represented the separately effect of expression of same node.In fact the node of dotted line connection is the different ingredients of same node.The part on the left side is similar to Fig. 3, does not repeat them here.Among the figure on the right, the connection of expression correlationship above, the connection of expression similarity relation below.Top correlationship is connected to relevant word and marks relevant frequency and distance.Following similarity relation is connected to similar word and marks similar degree.
The embodiment of language material storage preparation method of the present invention can be used the embodiment of language material storage preparation device of the present invention, realizes its word extraction step with Fig. 2, Fig. 3, Fig. 4, mode shown in Figure 5; The relation of inclusion making step; Frequency of occurrences calculation procedure; Degree of correlation similarity calculation procedure; And the language material storage preparation step, have the corpus of vertical relation of inclusion structure between word, network of relation, similar network when obtaining as shown in Figure 6.
(embodiment)
Below specify the flow process of language material storage preparation of the present invention with an example.
In this example, training sample adopts one section following article:
Europe man's gymnastics match closing
The 19 the European man's gymnastics championship in Xinhua News Agency Lausanne May 27 (reporter Shi Guangyao) contended through 3 days, and finish at Lausanne, SUI afternoon on the 27th.The good strong wind of Soviet Union player seizes 6 pieces (1 piece side by side) in whole 8 pieces of gold medals.Soviet Union star Mo Jilini obtains individual all-round competition, pommel horse and 3 pieces of gold medals of parallel bars (side by side), Suo Heerbo gain freedom gymnastics, horse-vaulting and three champions of horizontal bar.Sharp Buddhist nun of Switzerland player Ji Beier and Italian player Kai Ji obtain parallel bars and champion of the swinging rings respectively.73 sportsmen from 25 European countries have participated in current match.(End)
Word extraction unit 104 is utilized segmentation of words instrument, and the content of one piece of article is cut into one by one independently word, wherein mainly extracts noun out.The result is as follows in output:
Europe man's gymnastics match Lausanne reporter of Xinhua News Agency shines European man's gymnastics championship and contends the Lausanne, SUI curtain Soviet Union player strong wind gold medal Soviet Union star individual all-round competition pommel horse parallel bars gold medal horse-vaulting horizontal bar champion Switzerland player Bel Italy player parallel bars champion of the swinging rings sportsman of European countries match
Relation of inclusion preparing department 106 uses for reference expert's classification (knowing net) output result as shown in Figure 7 simultaneously according to the output of word extraction unit, promptly vertical relation of inclusion structure.
Frequency of occurrences calculating part 108 receives the set of the word of cutting, is in the window of w by predefined width, and pieces of training sample is scanned.If certain two word occurs in window simultaneously, then think these two word co-occurrences once, be spaced apart the co-occurrence distance between two words.Through statistics, obtain with each speech as keyword the average co-occurrence distance and the co-occurrence frequency of the word of other relevant with this keyword.
In following table, " KEY " represents keyword; " REL_NODE " expression interdependent node " frequency " expression co-occurrence frequency; Ave_dis represents average co-occurrence distance.KEY: man
REL_NODE[1]=gymnastics match ave_dis=1.000000 frequency=1
REL_NODE[2]=the ave_dis=2.000000 frequency=1 of Xinhua News Agency
REL_NODE[3]=Lausanne ave_dis=4.000000 frequency=2
REL_NODE[4]=reporter ave_dis=4.000000 frequency=1
REL_NODE[5]=shine ave_dis=5.000000 frequency=1
REL_NODE[6]=gymnastics ave_dis=1.000000 frequency=1
REL_NODE[7]=championship ave_dis=2.000000 frequency=1
REL_NODE[8]=contention ave_dis=3.000000 frequency=1
REL_NODE[9]=Switzerland ave_dis=4.000000 frequency=1
As can be seen from the above table, for example, the average co-occurrence distance of " man " and " reporter " is 4.000000, and co-occurrence frequency is 1.
Degree of association calculating part 110 calculates the degree of correlation between two words and the similarity between two words.At first average co-occurrence distance between the word that obtains according to frequency of occurrences calculating part statistics and co-occurrence frequency calculate the degree of correlation between word and the word:
Wherein,
Be the mean distance ave_dis in the table, γ is an adjustable parameter.So just can obtain the degree of correlation between two speech.
If get γ=0.5, as above the mean distance of " man " and " reporter " is 4.000000, and co-occurrence frequency is 1, if the frequency of occurrences of this moment " man " is 10, and the degree of correlation of two speech then
According to similarity calculating method mentioned above, based on the vertical relation of inclusion structure and the degree of correlation network of above corpus, use formula (2) is calculated the similarity in this example:
sim(w
1,w
2)=αsim
semantic(w
1,w
2)+βsim
statistic(w
1,w
2)(2)
α, β ∈ (0,1) and alpha+beta=1
In vertical relation of inclusion, for example, if two words have set membership, bee-line Dis therebetween
Semantic(w
1, w
2) can be taken as 2, its semantic similarity sim
Semantic(w
1, w
2) be 0.5.Sim
Statistic(w
1, w
2) expression w
1And w
2The statistical dependence degree, be the degree of correlation that following formula (1) calculates gained
For example, get α=0.4 at this, the similarity between " man " and " reporter " is calculated in β=0.6.Owing to do not have set membership, sim between " man " and " reporter "
Semantic(w
1, w
2) be 0, sim
Statistic(w
1, w
2) be 0.01111 of formula (1) calculating gained.Then the similarity between two speech obtains with formula (2), sim (W
1, W
2)=0.4 * 0+0.6 * 0.01111=0.006666
The record that will meet the corpus structure thus by language material storage preparation portion 112, for example, the key words of statistics gained as " man ", related words, is kept in the corpus preservation portion 114 as " reporter " and their similarity " 0.006666 " form with a data-base recording above.The corpus structure that constitutes according to this example as shown in Figure 8.Like this, comprise word, key words has constituted corpus with the record of vertical relation of inclusion structure, its corresponding network of relation and the similar network of related words.When the degree of correlation of two words of needs or similarity information, just from this corpus, read.
In addition, start the reduction condition (that is: as adding a new relation (newrelation) " man and football " this moment, satisfying
), window width w=6, γ=3, and this moment ave_dis=1, frequency=1,, then by formula (5) the weight of above-mentioned new relation:
And the pass of this moment " man " and " Switzerland " is Weight (relation)=1 * 1/10 * (3/ (4+3))=0.043, so this relation will be reduced, new relation " man " and " football " then are added into corpus.Thus, make corpus obtain upgrading.
Above-mentioned example is just in order to illustrate the example of embodiment of the present invention, and the present invention also can adopt other implementation of modification to carry out.Language material storage preparation device can be that core devices constitutes with the processor.The corpus of making can be realized with memory device commonly used such as hard disk, disk.
More than language material storage preparation device of the present invention and method thereof have been done detailed explanation.Modification that those skilled in the art are made within the spirit and scope of the present invention and improvement should be included in the appended claim restricted portion of the present invention.
Claims (12)
1. language material storage preparation device that comprises word extraction unit, frequency of occurrences calculating part, degree of association calculating part, language material storage preparation portion, it is characterized in that: this language material storage preparation device also comprises relation of inclusion preparing department, wherein,
Described word extraction unit is carried out cutting to training sample, obtains word sequence;
Described relation of inclusion preparing department sets up vertical relation of inclusion structure to the speech that described word extraction unit obtains with tree structure based on the semanteme between the word;
Described frequency of occurrences calculating part calculates co-occurrence frequency and the co-occurrence distance between word;
Described degree of association calculating part calculates the degree of correlation between word according to the result of calculation of described frequency of occurrences calculating part, and then the vertical relation of inclusion structure set up according to described relation of inclusion preparing department and the similarity between described relatedness computation word;
Described language material storage preparation portion deposits vertical relation of inclusion, the degree of correlation, the similarity of word, word in corpus preservation portion.
2. language material storage preparation device according to claim 1 is characterized in that, described frequency of occurrences calculating part is pressed the degree of correlation between the described word of following formula (1) calculating
3. language material storage preparation device according to claim 2 is characterized in that: described degree of association calculating part is pressed the similarity sim (w between two words of following formula (2) calculating
1, w
2):
sim(w
1,w
2)=αsim
semantic(w
1,w
2)+βsim
statistic(w
1,w
2) (2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, sim
Semantic(w
1, w
2) expression word w
1With word w
2Semantic similarity, sim
Statistic(w
1, w
2) the described word w of expression
1With word w
2Between the statistical dependence degree, α and β are adjustable parameter.
4. language material storage preparation device according to claim 3 is characterized in that: described degree of association calculating part is pressed following formula (3) and is calculated described word w
1With word w
2Semantic similarity sim
Semantic(w
1, w
2):
sim
semantic(w
1,w
2)=1/Dis
semantic(w
1,w
2) (3)
Dis wherein
Semantic(w
1, w
2) the word w that obtains in described vertical relation of inclusion structure of constituting according to described relation of inclusion preparing department of expression
1With word w
2Between bee-line.
5. language material storage preparation device according to claim 3 is characterized in that: described word w
1With word w
2Between statistical dependence degree sim
Statistic(w
1, w
2) be described word w
1With word w
2The degree of correlation
6. language material storage preparation device according to claim 1 is characterized in that: at described frequency of occurrences calculating part, press following formula (4) and calculate word w
iConcern number k
i:
Wherein,
The average frequency of occurrences of representing all words in the described corpus, wfreq
iExpression word w
iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w
iRelation sum surpass threshold value δ * k
iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w
iThe word w that concerns the weight minimum
jReduce, the described weights W eight (Relation) that concerns presses following formula (5) calculating:
Wherein,
Expression word w
iWith word w
jCo-occurrence frequency,
Expression word w
iWith word w
jThe co-occurrence weight.
7. language material storage preparation method is characterized in that: may further comprise the steps:
Word extraction step: the training sample content is carried out cutting, obtain word sequence;
Relation of inclusion making step:, the word that word extraction step obtains is set up vertical relation of inclusion structure with tree structure based on the semanteme between the word;
Frequency of occurrences calculation procedure: calculate the frequency of occurrences of word, co-occurrence frequency, co-occurrence distance and average co-occurrence distance between two words;
Degree of correlation similarity calculation procedure: according to the degree of correlation between two words of result of calculation calculating of described frequency of occurrences calculation procedure, and then according to vertical relation of inclusion structure of described relation of inclusion making step foundation and the similarity between two words of described relatedness computation;
Language material storage preparation step: the word that obtains in the above step, vertical relation of inclusion structure, the degree of correlation and the similarity between them are constructed corpus as record.
8. language material storage preparation method according to claim 7 is characterized in that: in described degree of correlation similarity calculation procedure, be calculated as follows the degree of correlation between described two words
9. language material storage preparation method according to claim 8 is characterized in that: in described degree of correlation similarity calculation procedure, as shown in the formula the similarity sim (w between described two words of (2) calculating
1, w
2):
sim(w
1,w
2)=αsim
semantic(w
1,w
2)+βsim
statistic(w
1,w
2) (2)
α, β ∈ (0,1) and alpha+beta=1
Wherein, sim
Semantic(w
1, w
2) expression word w
1With word w
2Semantic similarity, sim
Statistic(w
1, w
2) be described word w
1With word w
2Between the statistical dependence degree, α and β are adjustable parameter.
10. language material storage preparation method according to claim 9 is characterized in that: described word w
1With word w
2Semantic similarity sim
Semantic(w
1, w
2) calculate by following formula (3):
sim
semantic(w
1,w
2)=1/Dis
semantic(w
1,w
2) (3)
Dis wherein
Semantic(w
1, w
2) the word w that obtains in described vertical relation of inclusion structure of setting up according to described relation of inclusion making step of expression
1With word w
2Between bee-line.
11. language material storage preparation method according to claim 9 is characterized in that: described word w
1With word w
2Between statistical dependence degree sim
Statistic(w
1, w
2) be described word w
1With word w
2Between the degree of correlation
12. language material storage preparation method according to claim 7 is characterized in that, also comprises the reduction step: press following formula (4) and calculate word w
iConcern number k
i:
Wherein,
The average frequency of occurrences of representing all words in the described corpus, wfreq
iExpression word w
iThe frequency of occurrences, k represents the average relationship number of all words in the described corpus;
As word w
iRelation sum surpass threshold value δ * k
iThe time, wherein, δ is predefined greater than 1 cushioning coefficient, to word w
iThe word w that concerns the weight minimum
jReduce, the described weights W eight (Relation) that concerns presses following formula (5) calculating:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2005100932280A CN1916889B (en) | 2005-08-19 | 2005-08-19 | Language material storage preparation device and its method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2005100932280A CN1916889B (en) | 2005-08-19 | 2005-08-19 | Language material storage preparation device and its method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1916889A CN1916889A (en) | 2007-02-21 |
CN1916889B true CN1916889B (en) | 2011-02-02 |
Family
ID=37737887
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2005100932280A Expired - Fee Related CN1916889B (en) | 2005-08-19 | 2005-08-19 | Language material storage preparation device and its method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1916889B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5382651B2 (en) * | 2009-09-09 | 2014-01-08 | 独立行政法人情報通信研究機構 | Word pair acquisition device, word pair acquisition method, and program |
CN102591862A (en) * | 2011-01-05 | 2012-07-18 | 华东师范大学 | Control method and device of Chinese entity relationship extraction based on word co-occurrence |
CN102609424B (en) * | 2011-01-21 | 2014-10-08 | 日电(中国)有限公司 | Method and equipment for extracting assessment information |
CN104077295A (en) * | 2013-03-27 | 2014-10-01 | 百度在线网络技术(北京)有限公司 | Data label mining method and data label mining system |
CN105608083B (en) * | 2014-11-13 | 2019-09-03 | 北京搜狗科技发展有限公司 | Obtain the method, apparatus and electronic equipment of input magazine |
US10198471B2 (en) * | 2015-05-31 | 2019-02-05 | Microsoft Technology Licensing, Llc | Joining semantically-related data using big table corpora |
CN106202311B (en) * | 2016-06-30 | 2020-03-10 | 北京奇艺世纪科技有限公司 | File clustering method and device |
CN106202380B (en) * | 2016-07-08 | 2019-12-24 | 中国科学院上海高等研究院 | Method and system for constructing classified corpus and server with system |
CN108197120A (en) * | 2017-12-28 | 2018-06-22 | 中译语通科技(青岛)有限公司 | A kind of similar sentence machining system based on bilingual teaching mode |
CN110321404B (en) * | 2019-07-10 | 2021-08-10 | 北京麒才教育科技有限公司 | Vocabulary entry selection method and device for vocabulary learning, electronic equipment and storage medium |
CN110334215B (en) * | 2019-07-10 | 2021-08-10 | 北京麒才教育科技有限公司 | Construction method and device of vocabulary learning framework, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1110882A (en) * | 1993-06-18 | 1995-10-25 | 欧洲佳能研究中心有限公司 | Methods and apparatuses for processing a bilingual database |
CN1116342A (en) * | 1994-07-08 | 1996-02-07 | 唐武 | Chinese automatic proofreading method and system thereof |
CN1387651A (en) * | 1999-11-05 | 2002-12-25 | 微软公司 | System and iterative method for lexicon, segmentation and language model joint optimization |
-
2005
- 2005-08-19 CN CN2005100932280A patent/CN1916889B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1110882A (en) * | 1993-06-18 | 1995-10-25 | 欧洲佳能研究中心有限公司 | Methods and apparatuses for processing a bilingual database |
CN1116342A (en) * | 1994-07-08 | 1996-02-07 | 唐武 | Chinese automatic proofreading method and system thereof |
CN1387651A (en) * | 1999-11-05 | 2002-12-25 | 微软公司 | System and iterative method for lexicon, segmentation and language model joint optimization |
Also Published As
Publication number | Publication date |
---|---|
CN1916889A (en) | 2007-02-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1916889B (en) | Language material storage preparation device and its method | |
CN104199857B (en) | A kind of tax document hierarchy classification method based on multi-tag classification | |
CN101398814B (en) | Method and system for simultaneously abstracting document summarization and key words | |
CN110059311A (en) | A kind of keyword extracting method and system towards judicial style data | |
CN108197111A (en) | A kind of text automatic abstracting method based on fusion Semantic Clustering | |
CN106257441B (en) | A kind of training method of the skip language model based on word frequency | |
US8560485B2 (en) | Generating a domain corpus and a dictionary for an automated ontology | |
US8200671B2 (en) | Generating a dictionary and determining a co-occurrence context for an automated ontology | |
CN101286161A (en) | Intelligent Chinese request-answering system based on concept | |
CN104866496A (en) | Method and device for determining morpheme significance analysis model | |
Mackenzie et al. | Efficiency implications of term weighting for passage retrieval | |
CN100535895C (en) | Test search apparatus and method | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN116050397B (en) | Method, system, equipment and storage medium for generating long text abstract | |
CN106294733A (en) | Page detection method based on text analyzing | |
CN110083696A (en) | Global quotation recommended method, recommender system based on meta structure technology | |
CN101187919A (en) | Method and system for abstracting batch single document for document set | |
Trabelsi et al. | Improved table retrieval using multiple context embeddings for attributes | |
CN1916904A (en) | Method of abstracting single file based on expansion of file | |
CN114138931A (en) | Mathematical formula perception indexing and ranking method, storage medium and equipment | |
El Mahdaouy et al. | Semantically enhanced term frequency based on word embeddings for Arabic information retrieval | |
Amini | Interactive learning for text summarization | |
Zhou et al. | Query expansion for personalized cross-language information retrieval | |
CN117131383A (en) | Method for improving search precision drainage performance of double-tower model | |
CN114580557A (en) | Document similarity determination method and device based on semantic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20110202 Termination date: 20180819 |