Content of the invention
The invention discloses a kind of English-Chinese meaning of a word mapping method based on term vector and device, to be more effectively carried out the meaning of a word
Mapping.
For this reason, the present invention following technical scheme of offer:
A kind of English-Chinese meaning of a word mapping method based on term vector, comprises the following steps:
Step one, by English knowledge base extract the meaning of a word to be mapped synset, then according to English-Chinese dictionary inquire about synonymous
Each synon candidate Chinese meaning of a word in word set;
Step 2, the english note by the English knowledge base extraction meaning of a word to be mapped and example sentence, and inquired about according to English-Chinese dictionary
The english note of each candidate Chinese meaning of a word of step one gained and example sentence;
Step 3, train term vector on English corpus on a large scale, then for each english note of step 2 gained
Generate sentence vector with example sentence respectively;
The sentence vector of step 4, the english note of the meaning of a word to be mapped of calculation procedure three gained and example sentence and candidate Chinese
The similarity of the sentence vector of the english note of the meaning of a word and example sentence, then calculates the synthesis of the meaning of a word to be mapped and the candidate Chinese meaning of a word
Similarity;
The maximum candidate Chinese meaning of a word of step 5, selection comprehensive similarity is as the target meaning of a word of the meaning of a word to be mapped.
Further, in described step one, when extracting synset and the query candidate Chinese meaning of a word, particularly as follows:
Step 1-1) by English knowledge base, extract the synset of the meaning of a word to be mapped;
Step 1-2) according to English-Chinese dictionary, inquire about each synon candidate Chinese meaning of a word in synset.
Further, in described step 2, when extracting english note and example sentence, particularly as follows:
Step 2-1) by English knowledge base, extract english note and the example sentence of the meaning of a word to be mapped;
Step 2-2) according to English-Chinese dictionary, query steps 1-2) english note of each candidate Chinese meaning of a word of gained and example
Sentence.
Further, in described step 3, when training term vector and generating sentence vector, particularly as follows:
Step 3-1) train term vector on English corpus on a large scale;
Step 3-2) english note of step 2 gained and example sentence are carried out by lemmatization, extract the pretreatment such as notional word;
Step 3-3) according to step 3-1) term vector of gained, it is step 3-2) process the english note obtaining and example sentence divides
Not Sheng Cheng sentence vector, particularly as follows:
English note or example sentence are denoted as s, a certain notional word in sentence is denoted as w, then the sentence vector of sentence sCan be by
Formula (1) obtains;
Wherein, | s | represents the quantity of the notional word that sentence s comprises,Represent notional word wkTerm vector.
Further, in described step 4, when calculating acceptation similarity, particularly as follows:
Step 4-1) english note of the meaning of a word to be mapped of calculation procedure three gained and example sentence sentence vector and candidate in
The similarity of the sentence vector of the english note of cliction justice and example sentence, particularly as follows:
English note or example sentence are denoted as s;
Any two sentence siAnd sjSentence vector similarity can be tried to achieve by formula (2);
Wherein,WithRepresent sentence siAnd sjSentence vector,WithRepresent vectorWithMould.By formula (1)
Substitute into formula (2), formula (3) can be obtained.
In order that similarity score is between 0 to 1, in order to be compared to it afterwards, by the sentence in formula (3) to
AmountUsing functionDo normalized, then formula (3) translates into formula (4);
Wherein, functionNormalized, that is, refer to byIt is converted into unit vector.This process only changes vector magnitude
Do not change direction, the cosine similarity not affecting vector calculates.
Step 4-2) by step 4-1) english note of gained and the sentence vector similarity of example sentence, calculate the meaning of a word to be mapped
With the comprehensive similarity of the candidate Chinese meaning of a word, particularly as follows:
The meaning of a word to be mapped in English knowledge base is denoted as bs, a certain candidate Chinese meaning of a word is denoted as ds, it is comprehensive similar
Degree can be calculated by formula (5);
Wherein, bsglEnglish note for bs, dsglEnglish note for ds, bsexsEnglish example sentence for bs, dsexs
English example sentence for ds, bsexFor bsexsIn an example sentence, dsexFor dsexsIn an example sentence, α and (1- α) are respectively
Represent the weight of annotation and example sentence, sim (bsgl,dsgl) and sim (bsex,dsex) calculated by formula (4).
Further, in described step 5, select the maximum candidate Chinese meaning of a word of comprehensive similarity as the meaning of a word to be mapped
The target meaning of a word when, particularly as follows:
The meaning of a word to be mapped in English knowledge base is denoted as bs, a certain candidate Chinese meaning of a word is denoted as ds, then bs maps
Target meaning of a word ts can be obtained by formula (6);
Wherein, dss represents the set of the candidate Chinese meaning of a word of bs, dsiRepresent i-th candidate Chinese meaning of a word in dss,
score(bs,dsi) can be calculated by formula (5) and try to achieve.
A kind of English-Chinese meaning of a word mapping device based on term vector, comprising:
Candidate's meaning of a word query unit, for extracting the synset of the meaning of a word to be mapped, then basis in English knowledge base
Each synon candidate Chinese meaning of a word in English-Chinese dictionary inquiry synset;
Annotation and example sentence extraction unit, for extracting english note and the example sentence of the meaning of a word to be mapped in English knowledge base, and
The english note of each candidate Chinese meaning of a word according to English-Chinese dictionary query candidate meaning of a word query unit gained and example sentence;
Sentence vector signal generating unit, for training term vector on English corpus on a large scale, then for annotation and example sentence
Each english note of extraction unit gained and example sentence generate sentence vector respectively;
Acceptation similarity computing unit, for calculating the english note of the meaning of a word to be mapped of sentence vector signal generating unit gained
With the similarity of the sentence vector of example sentence and the english note of the candidate Chinese meaning of a word and the sentence vector of example sentence, then calculate and wait to reflect
Penetrate the comprehensive similarity of the meaning of a word and the candidate Chinese meaning of a word;
Target meaning of a word select unit, for selecting the maximum candidate Chinese meaning of a word of comprehensive similarity as the meaning of a word to be mapped
The target meaning of a word.
Further, described candidate's meaning of a word query unit also includes:
Synset extraction unit, for extracting the synset of the meaning of a word to be mapped;
Candidate Chinese meaning of a word query unit, for inquiring about each synon candidate Chinese meaning of a word in synset;
Further, described annotation and example sentence extraction unit also include:
Word sense information extraction unit to be mapped, for extracting english note and the example sentence of the meaning of a word to be mapped;
Candidate's meaning of a word information extraction unit, for extracting each candidate Chinese word of candidate Chinese meaning of a word query unit gained
The english note of justice and example sentence;
Further, described sentence vector signal generating unit also includes:
Term vector training unit, for training term vector on English corpus on a large scale;
Word sense information pretreatment unit, for carrying out word to annotation and the english note of example sentence extraction unit gained and example sentence
The pretreatment such as shape reduction, extraction notional word;
Sentence vector signal generating unit, for being word sense information pretreatment unit according to term vector training unit gained term vector
The english note obtaining and example sentence generate sentence vector respectively;
Further, described acceptation similarity computing unit also includes:
Sentence vector similarity computing unit, for calculating the English of the meaning of a word to be mapped of sentence vector signal generating unit gained
The sentence vector of the annotation and example sentence similarity vectorial with the english note of the candidate Chinese meaning of a word and the sentence of example sentence;
Comprehensive similarity computing unit, the english note according to sentence vector similarity computing unit gained and the sentence of example sentence
Subvector similarity, calculates the comprehensive similarity of the meaning of a word to be mapped and the candidate Chinese meaning of a word.
Beneficial effects of the present invention:
1st, the English-Chinese meaning of a word mapping method based on term vector proposed by the present invention and device, are a kind of full automatic words
Benefit film showing shooting method, can avoid the loaded down with trivial details manual labor of traditional-handwork mapping method.
2nd, the English-Chinese meaning of a word mapping method based on term vector proposed by the present invention and device, have given full play to deep learning
Advantage, generates sentence vector using term vector technology, being capable of the relatively accurately selection target meaning of a word, it is to avoid conventional machines translations
The relatively low problem of the accuracy of mapping method.
3rd, the English-Chinese meaning of a word mapping method based on term vector proposed by the present invention and device, consider the meaning of a word annotation and
Example sentence information, using the term vector technology of deep learning complete annotate and example sentence Similarity Measure, to both weighted sums with
Calculate comprehensive similarity, thus the selection target meaning of a word, there is higher mapping accuracy.
4th, the English-Chinese meaning of a word mapping method based on term vector proposed by the present invention and device, when calculating sentence similarity,
Only remain the notional word in sentence, the interference of unrelated function word in sentence can be avoided, improve meaning of a word mapping accuracy.
Specific embodiment:
In order that those skilled in the art more fully understand the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings and implement
Mode is described in further detail to inventive embodiments.
Babelnet is multi-lingual knowledge base, and it has built more complete English meaning of a word knowledge base at present, but its Chinese
Knowledge concerning word sense storehouse is simultaneously incomplete, still lacks the automatic mapping that effective meaning of a word mapping method completes the English-Chinese meaning of a word at present.This patent
Seek to a kind of English-Chinese meaning of a word mapping method based on term vector and device, solve similar English-Chinese meaning of a word mapping and ask
Topic.
One meaning of a word " measure to be mapped is extracted by babelnet;mensurate;Measure_out ", its concrete language
Justice description is as shown in table 1.The specific embodiment of the present invention, is described taking this meaning of a word as a example.
Table 1
The English-Chinese meaning of a word mapping method flow chart based on term vector for the embodiment of the present invention, as shown in figure 1, include following walking
Suddenly.
Step 101, the query candidate meaning of a word.
Extracted the synset of the meaning of a word to be mapped by English knowledge base, then inquired about according to English-Chinese dictionary each in synset
The individual synon candidate Chinese meaning of a word, particularly as follows:
Step 1-1) by English knowledge base, extract the synset of the meaning of a word to be mapped, process is as follows:
The present embodiment is directed to the English-Chinese meaning of a word mapping of babelnet and carries out, and the English knowledge base being adopted is
Babelnet knowledge base.Similar with wordnet, the meaning of a word of babelnet is represented in the form of synset.As shown in Table 1, when
The synset of the front meaning of a word to be mapped is { measure, mensurate, measure_out }.
Step 1-2) according to English-Chinese dictionary, inquire about each synon candidate Chinese meaning of a word in synset, process is as follows:
In the embodiment of the present invention, English-Chinese dictionary adopts Collins's high-order English-Chinese dictionary.Collins's high-order English-Chinese dictionary for
Each meaning of a word all has careful English-Chinese description information, that simultaneously provides English-Chinese meaning of a word annotation and example sentence.In Collins's high-order English
In Chinese dictionary, each English words have one or more corresponding Chinese the meaning of a word, each Chinese meaning of a word have an english note and
One or more English-Chinese example sentences, these english informations are that the implementation of this patent provides resource support well.
In the embodiment of the present invention, according to Collins's high-order English-Chinese dictionary, to synset measure, mensurate,
Measure_out } in each synonym inquiry obtain the candidate Chinese meaning of a word, as shown in table 2.
Table 2
Numbering |
Chinese word sense describes |
1 |
Weigh;Estimate;Assessment;Judge |
2 |
Measurement;Tolerance;Metering |
3 |
Apart from (or length, width, quantity etc.) it is ... |
4 |
(by required) measures, and measures |
Step 102, extracts annotation and example sentence.
Extract english note and the example sentence of the meaning of a word to be mapped by English knowledge base, and according to English-Chinese dictionary query steps 101
The english note of each candidate Chinese meaning of a word of gained and example sentence, particularly as follows:
Step 2-1) by English knowledge base, extract english note and the example sentence of the meaning of a word to be mapped, process is as follows:
In the embodiment of the present invention, according to English knowledge base babelnet, the english note and example sentence of the meaning of a word to be mapped is entered
Row extracts.From table 1 information, the english note of the meaning of a word to be mapped and example sentence information are as shown in table 3.
Table 3
Step 2-2) according to English-Chinese dictionary, query steps 1-2) english note of each candidate Chinese meaning of a word of gained and example
Sentence, process is as follows:
In the embodiment of the present invention, according to Collins's high-order English-Chinese dictionary, extraction step 1-2 successively) institute's call number be 1,2,
3rd, 4 english note of each candidate Chinese meaning of a word and example sentence, as shown in table 4.For ease of understanding, in table 4, give phase simultaneously
The Chinese meaning of a word answered.
Table 4
Step 103, generates sentence vector.
Train term vector, then each english note for step 102 gained and example sentence on English corpus on a large scale
Generate sentence vector respectively, particularly as follows:
Step 3-1) train term vector on English corpus on a large scale.
The embodiment of the present invention is carried in the University of Pennsylvania using the term vector instrument word2vec toolkit of google
For the 5th edition english gigaword data set on train term vector, vector dimension be 200, other training parameters etc. set
Put all using default value.English gigaword is an English newsletter archive packet, and it covers seven kinds of different English
Civilian world news source, 9876086 documents altogether, common 26348mb, can be consumed by the linguisticss collaboration data of the University of Pennsylvania
Expense time several years arranges and forms.
Step 3-2) english note of step 102 gained and example sentence are carried out by lemmatization, extract the pretreatment such as notional word;
In the embodiment of the present invention, the stanford corenlp toolkit using Stanford University is carried out to english sentence
Lemmatization, then extracts notional word.Its concrete processing procedure, taking the annotation process of the meaning of a word to be mapped as a example illustrates.
Annotation " determine the measurements of something or first to the meaning of a word to be mapped
Somebody, take measurements of " carries out lemmatization, can obtain " determine the measurement of
something or somebody,take measurement of”;Then, therefrom extract notional word, " determine can be obtained
measurement something somebody take measurement”.
Step 3-3) by step 3-1) term vector of gained, to step 3-2) process the english note obtaining and example sentence respectively
Generate sentence vector, particularly as follows:
English note or example sentence are denoted as s, a certain notional word in sentence is denoted as w, then the sentence vector of sentence sCan be by
Formula (1) obtains;
Wherein, | s | represents the quantity of the notional word that sentence s comprises,Represent notional word wkTerm vector.
In present example, with step 3-2) in the meaning of a word to be mapped annotation " the determine measurement that obtains
As a example the process of something somebody take measurement ", the generation method of declarative sentence vector.
First, by step 3-1) term vector trained, extracts the term vector of each notional word in sentence.Such as, determine
Term vector be: [- 0.060966704, -0.06865787, -0.13976261,0.052583452,0.02309357, -
0.015850635,0.0057524024,0.004298664,0.07135361,-0.004907789,-0.0073844297,-
0.0660588,-0.09741554,-0.0826721,0.0020558392,0.0019447851,-0.044812344,
0.1433886,0.107519455,-0.013067925,0.055411655,0.098691314,-0.11813014,
0.028893137,-0.10136866,0.024213811,-0.021338113,-0.006830832,-0.01115726,
0.023671253,0.022735655,-0.106075086,-0.0060708467,-0.06795107,-0.024008093,-
0.10278628,0.110742025,0.06967174,-0.026281023,-0.1304829,-0.18443915,-
0.01603829,0.024118813,-0.02448944,0.08606661,0.04368876,-0.027071448,
0.06927168,-0.16086423,-0.09339183,0.048664782,-0.0037259995,-0.19597004,-
0.05804217,-0.042547442,-0.105807476,0.013699462,0.09974968,-0.038489617,-
0.0507417,0.08751733,0.03520148,0.062430475,0.011540262,-0.12392134,
0.10225074,-0.04389849,-0.053057443,-0.014595923,0.15838726,-0.036213677,-
0.022729969,0.12135271,0.053754877,0.0653142,-0.11217302,-0.032784045,-
0.02645095,-0.0058537563,-0.037233904,-0.091778874,-0.017529158,0.03335303,-
0.11941094,0.12519278,0.045954995,-0.07207713,-0.040876612,-0.093257025,
0.06504259,0.005461387,0.06069275,0.030098341,-0.007988872,-0.027645452,-
0.032660615,-0.062259212,-0.020880515,0.076618314,0.046356063,-0.07308063,
0.03509143,-0.08876938,-0.02635127,-0.012593604,0.14288785,0.045763995,-
0.024156947,0.04318199,-0.012540084,-0.10338905,-0.031343687,-0.04143757,-
0.024850031,0.12515464,0.13902804,0.045706462,0.094424434,0.06911446,-
0.042245053,-0.01119372,0.07074649,-0.06615113,0.059482194,0.06079544,-
0.0073646945,0.05371373,0.07749403,0.09774167,0.04614667,0.080500856,
0.06686461,-0.1371806,0.059351735,-0.11971834,-0.024769751,0.005559396,-
0.004569609,0.025109604,-0.010085186,0.06588754,-0.021475257,-0.12877394,-
0.011472024,0.019178912,-0.022502841,0.049072206,-0.07339941,-0.06519345,-
0.023635125,0.05878342,-0.041036837,0.016565796,0.13539337,-0.024638291,-
0.08239346,-0.00374239,0.0033550384,0.01374094,0.0065936707,-0.030307738,
0.009063287,-0.021692682,-0.09899706,0.04887318,0.037609883,-0.045150857,-
0.09769283,-0.06568951,-0.13722141,0.018394174,0.03404645,-0.08603616,-
0.07023705,0.14471957,-0.059314273,0.0674724,-0.07376034,0.041695137,-
0.03897431,-0.12877795,-0.057006553,-0.018086433,0.022128537,-0.08181979,-
0.08615692,0.029183147,-0.090377316,0.069178686,-0.015696429,-0.0043464974,
0.0035500522,0.1526469,0.09442544,0.012619695,0.09376681,0.06574002,
0.032735877,-0.06054757,0.031108197].The term vector of measurement is: [- 0.030921048,
0.040468287,0.07367502,-0.036431145,0.09001577,-0.10851831,0.031571753,-
0.0076946556,-0.025466012,0.08239048,-0.033852145,0.023865981,-0.06640976,
0.09898748,-0.060916066,-0.12299272,-0.10123717,0.018511012,-0.017379025,
0.11183538,-0.032644443,0.061155915,-0.046167403,-0.02107625,-0.054799207,-
0.003215416,-0.022842003,-0.07484936,-0.016040549,-6.718859e-4,0.09849985,
0.10686533,-0.027949711,-0.014089485,0.08666428,-0.055681817,0.12596299,-
0.081768885,-0.023240687,-0.040215734,0.009278273,-0.072330184,0.011064145,-
0.046390835,0.009363516,0.07663736,-0.046891708,0.120461896,-0.024577046,-
0.065430254,-0.060996015,-0.031411856,-0.024597166,-0.022857357,-
0.019988738,-0.02650852,-0.046675686,-0.072701864,-0.06415478,-0.012159599,-
0.019452924,-0.007099012,-0.035306044,-0.046926122,-0.060533796,-0.069201075,
0.029004399,-0.024853425,-0.08013603,-0.040774312,0.10615162,0.036688466,
0.0055641048,-0.005188717,0.0027881414,0.061590068,-0.057311498,-
0.0018721737,0.032288115,-0.12578985,-0.1902009,-0.056136098,-4.728086e-4,-
0.061017197,0.04288104,0.01388723,-0.038211193,-0.043795947,-0.04814441,
0.1526314,0.033593766,0.078088604,0.005799715,0.03464157,-0.0035865682,-
0.20270306,-0.111725785,-0.09797781,-0.09489581,-0.054468293,-0.0015290832,-
0.16072103,0.056969997,0.013535669,-0.17215633,0.20882045,0.04354922,-
0.0025980647,0.08676594,0.0429361,0.029175945,-0.039518964,0.03309713,
0.027989952,-0.029852066,0.028658131,0.037572138,-0.064470336,0.0275685,-
0.094821155,0.14544079,-0.049508303,0.05595343,0.04108511,0.022339016,-
0.007031241,0.06387787,-0.051717743,0.035961512,0.0034367307,0.073031195,-
0.097252965,-0.060861535,0.12593704,-0.024983672,0.07234978,-0.04727927,-
0.19234574,0.11479137,0.013784515,-0.012358148,0.02151782,0.014949858,
0.03911975,-0.01054792,-0.07922059,0.036444385,0.025766745,-0.12601435,
0.047032543,-0.02278641,-0.13189878,0.111353576,-0.06969082,0.020863937,
0.01676644,0.009361927,0.039854113,-0.060249478,0.027769696,-0.27008596,
0.05944734,0.039832402,-0.026858494,-0.020013094,0.025406713,-2.128433e-4,-
0.05612445,0.04703572,-0.024139712,0.06555838,0.07517604,0.09585466,-
0.005991909,-0.0397101,-0.042226095,0.06041255,0.02176508,-0.027269356,-
0.038427215,-0.09381253,0.22008736,0.105541155,0.071456574,-0.016034195,
0.02069451,0.017009461,-0.07982682,-0.010532036,0.08931265,0.042708967,
0.018712737,-0.07463705,0.052128073,0.06920637,0.022202944,0.022940483,
0.05133759,-0.038717363,-0.013162929].In the same manner, obtain the term vector of each notional word in sentence one by one.
Then, by formula (1) by sentence each notional word term vector be added, can this meaning of a word annotation sentence vector
For:
[-0.12244331,0.23284505,-0.125848,-0.09857595,0.15176383,-
0.21165508,-0.06935414,0.17774323,-0.0481385,0.27167976,-0.23219745,-
0.31177434,-0.237795,0.20023781,-0.2208232,-0.25496095,-0.050965287,
0.19869018,0.14223932,0.054064974,0.14445543,0.3649017,-0.06972199,-
0.0942207,-0.4732177,-0.002447103,-0.11354132,-0.23180336,-0.032030072,
0.11646948,0.068802774,0.24477573,0.074090265,-0.30747676,0.28410295,-
0.3153889,0.48259473,0.0018074736,-0.2570166,-0.065705955,-0.29293522,
0.1187244,0.08923024,-0.023698367,0.078454815,0.2028578,-0.36501467,
0.40085053,-0.0051737167,-0.25175425,-0.11989543,-0.09693016,-0.095989406,
0.0065662824,0.01091335,-0.03598065,-0.12002948,-0.10372059,-0.28191066,
0.033649035,0.3604529,-0.047989205,-0.1641263,-0.21081169,-0.13621823,
0.33522972,-0.050793078,-0.0373758,-0.22907057,0.109199345,0.37030825,-
0.11889391,0.24283075,0.07673705,0.318008,-0.22766817,-0.42850304,-
0.071055345,0.1914971,-0.28046763,-0.6080315,-0.017843004,0.2313133,-
0.2477001,0.26103482,0.14874645,-0.09291037,-0.0409794,-0.23852225,
0.41014478,-0.17998967,0.31087965,0.11493398,-0.0023042597,-0.09591526,-
0.28730935,-0.49623907,-0.30990297,-0.22764425,-0.06879938,-6.009942e-4,-
0.25748277,0.00649539,0.21129256,-0.4945098,0.82365096,0.3147551,
0.0121324705,0.29460865,-0.13176502,-0.1077477,-0.19233456,0.08242655,
0.16084583,-0.13618916,0.11765827,0.23201033,-0.14476305,0.3566257,-
0.33154497,0.32010967,0.017003909,0.0983599,0.28363377,0.17411232,-
0.31067532,0.21472177,-0.18492793,0.09781431,0.060426474,0.3050918,-
0.12334619,-0.23786914,0.27095866,0.023499401,-0.07610657,-0.0463394,-
0.48189855,0.44204056,-0.030785767,0.046995677,-0.11442133,-0.32249418,-
0.13742244,-0.1368755,-0.21778521,0.061512135,-0.31345803,-0.19940937,
0.09265008,-0.02924196,-0.15277626,0.30612707,0.41078234,0.099931955,-
0.14431237,0.16773543,-0.14954714,-0.044322092,-0.020516273,-0.52509534,
0.10045516,0.13150021,-0.1684227,0.059403583,0.3293987,0.24298555,-
0.3315874,-0.057996165,-0.34279677,0.24292094,0.2758336,-0.16648525,-
0.13480023,-0.18450123,-0.1112635,0.15073343,0.20073035,-0.097931616,-
0.2827055,-0.24364212,0.17794128,0.35367286,-0.012077071,-0.17940772,
0.08209381,0.08326046,-0.12982222,0.35156035,0.11034558,-0.0971424,
0.01952859,-0.070994884,0.22338426,0.10498668,-0.22422943,-0.04826733,
0.046616875,-0.326965,0.05593993].
In the same manner, the sentence vector corresponding to each english note and example sentence can be obtained.
Step 104, calculates acceptation similarity.
Sentence vector and the candidate Chinese meaning of a word of the english note of the meaning of a word to be mapped of calculation procedure 103 gained and example sentence
The similarity of the sentence vector of english note and example sentence, then calculates the meaning of a word to be mapped similar to the synthesis of the candidate Chinese meaning of a word
Degree, particularly as follows:
Step 4-1) english note of the meaning of a word to be mapped of calculation procedure 103 gained and example sentence sentence vector and candidate in
The similarity of the sentence vector of the english note of cliction justice and example sentence, particularly as follows:
English note or example sentence are denoted as s;
Any two sentence siAnd sjSentence vector similarity can be tried to achieve by formula (2);
Wherein,WithRepresent sentence siAnd sjSentence vector,WithRepresent vectorWithMould.By formula (1)
Substitute into formula (2), formula (3) can be obtained.
In order that similarity score is between 0 to 1, in order to be compared to it afterwards, by the sentence in formula (3) to
AmountUsing functionDo normalized, then formula (3) translates into formula (4);
Wherein, functionNormalized, that is, refer to byIt is converted into unit vector.This process only changes vector magnitude
Do not change direction, the cosine similarity not affecting vector calculates.
In the embodiment of the present invention, for the similarity calculating two sentence vectors, to calculate meaning of a word annotation to be mapped
" determine the measurements of something or somebody, take measurements of " with
English note " if you measure the quality, value, the or of the candidate Chinese meaning of a word being 1 is numbered in table 4
The sentence unit vector of effect of something, you discover or judge how great it is. " is similar
As a example degree.
First, the normalized of distich subvector, at the sentence vector of meaning of a word annotation to be mapped to step 103 gained
As a example reason.
To step 103 gained, the meaning of a word to be mapped annotates " determine the measurements of something
The sentence vector of or somebody, take measurements of "Carry out the conversion of unit vector, obtain vectorList
Bit vector is,
[-0.03826203,0.072761215,-0.03932595,-0.030803772,0.047424328,-
0.06613961,-0.021672316,0.055542573,-0.015042689,0.08489659,-0.07255885,-
0.09742565,-0.07430801,0.06257185,-0.069004536,-0.079672165,-0.015926026,
0.062088236,0.044448037,0.016894639,0.045140546,0.11402729,-0.021787263,-
0.02944281,-0.14787471,-7.646896e-4,-0.035480265,-0.0724357,-0.010009004,
0.03639528,0.021500021,0.076489404,0.023152297,-0.0960827,0.088778675,-
0.09855515,0.15080492,5.6481326e-4,-0.080314524,-0.020532303,-0.09153865,
0.037099913,0.027883353,-0.0074054482,0.024516165,0.06339057,-0.11406259,
0.12526086,-0.0016167228,-0.07867011,-0.037465848,-0.030289482,-0.029995508,
0.0020518824,0.0034102874,-0.01124351,-0.037507735,-0.032411408,-0.088093616,
0.010514909,0.112637095,-0.014996036,-0.051287454,-0.06587606,-0.042566523,
0.10475516,-0.015872212,-0.011679478,-0.071581736,0.03412345,0.11571677,-
0.037152883,0.07588162,0.023979386,0.099373594,-0.0711435,-0.13390192,-
0.022203922,0.05984049,-0.087642685,-0.19000237,-0.0055757193,0.07228256,-
0.07740323,0.08157017,0.046481434,-0.029033348,-0.012805559,-0.074535266,
0.1281652,-0.056244556,0.097146064,0.035915457,-7.200528e-4,-0.029972339,-
0.089780636,-0.1550686,-0.096840866,-0.07113603,-0.02149896,-1.878033e-4,-
0.0804602,0.0020297295,0.06602633,-0.15452823,0.25738078,0.098357104,
0.0037912477,0.09206158,-0.04117495,-0.03366983,-0.060102183,0.025757283,
0.050262343,-0.042557437,0.036766764,0.07250038,-0.045236673,0.11144115,-
0.10360372,0.10003033,0.0053135124,0.030736258,0.08863206,0.054407958,-
0.09708222,0.06709791,-0.0577877,0.030565768,0.01888253,0.095337436,-
0.038544167,-0.07433118,0.08467125,0.007343274,-0.023782367,-0.014480493,-
0.15058737,0.13813224,-0.009620174,0.014685571,-0.035755258,-0.100775465,-
0.042942822,-0.04277191,-0.0680552,0.019221786,-0.09795178,-0.062312976,
0.02895201,-0.009137753,-0.0477407,0.09566095,0.12836443,0.031227507,-
0.04509584,0.05241526,-0.046731643,-0.013850109,-0.006411083,-0.16408584,
0.031391002,0.0410922,-0.052630022,0.018562889,0.10293304,0.07592999,-
0.10361698,-0.018123088,-0.10711978,0.0759098,0.08619461,-0.05202459,-
0.04212341,-0.057654366,-0.034768473,0.047102343,0.06272577,-0.030602425,-
0.08834199,-0.076135166,0.05560446,0.11051842,-0.003773936,-0.056062706,
0.025653306,0.02601787,-0.04056785,0.10985829,0.034481637,-0.030355806,
0.006102444,-0.022185028,0.06980483,0.03280705,-0.07006894,-0.015082947,
0.0145672,-0.10217254,0.017480541].
In the same manner, can get the unit vector of each annotation other and example sentence sentence vector.
Number the english note of the candidate Chinese meaning of a word being 1 phase between the two in the meaning of a word to be mapped annotation and table 4
Can be tried to achieve by formula (4) like degree, calculating and trying to achieve this similarity is 0.3879761.
In the same manner, meaning of a word english note to be mapped can be calculated successively and the English numbering the candidate Chinese meaning of a word being 2,3,4
Annotation similarity, respectively 0.4196734,0.3625376,0.41536587.
In the same manner, the similarity of the example sentence of the meaning of a word to be mapped and the example sentence of the candidate Chinese meaning of a word can be calculated successively, as table 5 institute
Show.In table 5, meaning of a word only one of which example sentence to be mapped, it is numbered is ex;The of first meaning of a word of the example sentence of the candidate Chinese meaning of a word
The numbering of one example sentence is 1_ex1, and the numbering of second example sentence of its first meaning of a word is 1_ex2, each example of other each meaning of a word
The numbering of sentence is by that analogy.
Table 5
Meaning of a word illustrative sentence numbers to be mapped |
Candidate's meaning of a word illustrative sentence numbers |
Example sentence similarity |
ex |
1_ex1 |
0.33322173 |
ex |
1_ex2 |
0.3466332 |
ex |
1_ex3 |
0.34800234 |
ex |
2_ex1 |
0.7905501 |
ex |
2_ex2 |
0.40629613 |
ex |
3_ex1 |
0.5284378 |
ex |
3_ex2 |
0.5624604 |
ex |
3_ex2 |
0.5684977 |
ex |
4_ex1 |
0.35761255 |
ex |
4_ex2 |
0.3466332 |
Step 4-2) by step 4-1) english note of gained and the sentence vector similarity of example sentence, calculate the meaning of a word to be mapped
With the comprehensive similarity of the candidate Chinese meaning of a word, particularly as follows:
The meaning of a word to be mapped in English knowledge base is denoted as bs, a certain candidate Chinese meaning of a word is denoted as ds, it is comprehensive similar
Degree can be calculated by formula (5);
Wherein, bsglEnglish note for bs, dsglEnglish note for ds, bsexsEnglish example sentence for bs, dsexs
English example sentence for ds, bsexFor bsexsIn an example sentence, dsexFor dsexsIn an example sentence, α and (1- α) are respectively
Represent the weight of annotation and example sentence, sim (bsgl,dsgl) and sim (bsex,dsex) calculated by formula (4).
In the embodiment of the present invention, the comprehensive similarity of meaning of a word bs to be mapped and a certain candidate Chinese meaning of a word ds is calculated, with
As a example numbering the comprehensive similarity calculating between the candidate Chinese meaning of a word being 1 in the meaning of a word to be mapped in table 1 and table 4.
By known steps 4-1) number the English of the candidate Chinese meaning of a word being 1 in the gained meaning of a word to be mapped english note and table 4
Literary composition annotation similarity, sim (bsgl,dsgl)=0.3879761.In formula (5)Represent
Take a certain bsexWith a certain dsexBetween similarity the maximum, by step 4-1) the gained meaning of a word to be mapped English example sentence and numbering
Each example sentence similarity of the candidate Chinese meaning of a word for 1 is respectively 0.33322173,0.3466332,0.34800234, wherein
0.34800234 value is maximum, thereforeThrough lot of experiment validation, this
Weight in formula (5) is set to 0.4 by inventive embodiments.Can be obtained by formula (5), meaning of a word bs to be mapped with number be 1 time
Choose comprehensive similarity score (bs, ds)=0.4 × 0.3879761+ (the 1-0.4) × 0.34800234=of cliction justice ds
0.3480023443698883.
In the same manner, the comprehensive similarity of the meaning of a word to be mapped and other each candidate Chinese meaning of a word in table 4 can be obtained, as shown in table 6.
Table 6
Step 105, according to the acceptation similarity selection target meaning of a word.
Select the maximum candidate Chinese meaning of a word of comprehensive similarity as the meaning of a word to be mapped the target meaning of a word when, particularly as follows:
The meaning of a word to be mapped in English knowledge base is denoted as bs, a certain candidate Chinese meaning of a word is denoted as ds, then bs maps
Target meaning of a word ts can be obtained by formula (6);
Wherein, dss represents the set of the candidate Chinese meaning of a word of bs, dsiRepresent i-th candidate Chinese meaning of a word in dss,
score(bs,dsi) can be calculated by formula (5) and try to achieve.
In present example, as shown in Table 6, numbering be the 2 candidate Chinese meaning of a word meaning of a word comprehensive similarity score
Height, so this meaning of a word is using by the target word justice mapping result as the meaning of a word to be mapped.
By above operating procedure, you can complete the meaning of a word mappings work of the meaning of a word to be mapped.
Correspondingly, the embodiment of the present invention also provides a kind of English-Chinese meaning of a word mapping device based on term vector, its structural representation
Figure is as shown in Figure 2.
In this embodiment, described device includes:
Candidate's meaning of a word query unit 201, for extracting the synset of the meaning of a word to be mapped, then root in English knowledge base
According to each synon candidate Chinese meaning of a word in English-Chinese dictionary inquiry synset;
Annotation and example sentence extraction unit 202, for extracting english note and the example of the meaning of a word to be mapped in English knowledge base
Sentence, and the english note according to each candidate Chinese meaning of a word of English-Chinese dictionary query candidate meaning of a word query unit gained and example sentence;
Sentence vector signal generating unit 203, for training term vector on English corpus on a large scale, then for annotation and example
Each english note of sentence extraction unit gained and example sentence generate sentence vector respectively;
Acceptation similarity computing unit 204, for calculating the English of the meaning of a word to be mapped of sentence vector signal generating unit gained
The sentence vector of annotation and example sentence and the similarity of the english note of the candidate Chinese meaning of a word and the sentence vector of example sentence, then calculate
The meaning of a word to be mapped and the comprehensive similarity of the candidate Chinese meaning of a word;
Target meaning of a word select unit 205, for selecting the maximum candidate Chinese meaning of a word of comprehensive similarity as word to be mapped
The target meaning of a word of justice.
The structural representation of candidate's meaning of a word query unit 201 of Fig. 2 shown device as shown in figure 3, comprising:
Synset extraction unit 301, for extracting the synset of the meaning of a word to be mapped;
Candidate Chinese meaning of a word query unit 302, for inquiring about each synon candidate Chinese meaning of a word in synset.
The structural representation of the annotation of Fig. 2 shown device and example sentence extraction unit 202 as shown in figure 4, comprising:
Word sense information extraction unit 401 to be mapped, for extracting english note and the example sentence of the meaning of a word to be mapped;
Candidate's meaning of a word information extraction unit 402, for extracting in each candidate of candidate Chinese meaning of a word query unit gained
The english note of cliction justice and example sentence.
Fig. 2 shown device sentence vector signal generating unit 203 structural representation as shown in figure 5, comprising:
Term vector training unit 501, for training term vector on English corpus on a large scale;
Word sense information pretreatment unit 502, for entering to annotation and the english note of example sentence extraction unit gained and example sentence
The pretreatment such as row lemmatization, extraction notional word;
Sentence vector signal generating unit 503, for being word sense information pretreatment according to term vector training unit gained term vector
The english note that cell processing obtains and example sentence generate sentence vector respectively.
The structural representation of the acceptation similarity computing unit 204 of Fig. 3 shown device as shown in fig. 6, comprising:
Sentence vector similarity computing unit 601, for calculating the meaning of a word to be mapped of sentence vector signal generating unit gained
The sentence vector of the english note and example sentence similarity vectorial with the english note of the candidate Chinese meaning of a word and the sentence of example sentence;
Comprehensive similarity computing unit 602, the english note according to sentence vector similarity computing unit gained and example sentence
Sentence vector similarity, calculate the comprehensive similarity of the meaning of a word to be mapped and the candidate Chinese meaning of a word.
English-Chinese meaning of a word mapping device based on term vector shown in Fig. 2~Fig. 6 can be integrated in various hardware devices.
For example, it is possible to the English-Chinese meaning of a word mapping device based on term vector is integrated into: in the equipment such as pc, smart mobile phone, work station.
Can by using instruction or instruction set storage storing mode by embodiment of the present invention proposed word-based
The English-Chinese meaning of a word mapping method of vector is stored on various storage mediums.These storage mediums include but is not limited to: CD, hard
Disk, internal memory, u disk etc..
In sum, in embodiments of the present invention, extracted the synset of the meaning of a word to be mapped by English knowledge base, then
Each synon candidate Chinese meaning of a word in synset is inquired about according to English-Chinese dictionary;The meaning of a word to be mapped is extracted by English knowledge base
English note and example sentence, and inquire about english note and the example sentence of each candidate Chinese meaning of a word according to English-Chinese dictionary;Extensive
Training term vector on English corpus, is then each english note and example sentence generates sentence vector respectively;Calculate word to be mapped
The sentence vector of the english note of the justice and example sentence similarity vectorial with the english note of the candidate Chinese meaning of a word and the sentence of example sentence,
Then calculate the comprehensive similarity of the meaning of a word to be mapped and the candidate Chinese meaning of a word;Select the maximum candidate Chinese meaning of a word of comprehensive similarity
The target meaning of a word as the meaning of a word to be mapped.As can be seen here, it is achieved that English based on term vector after application embodiment of the present invention
The Chinese meaning of a word maps.Embodiment of the present invention can carry out meaning of a word mapping using the term vector technology in deep learning, can be effective
Consider the semantic relation between word in sentence;For the feature of english sentence, the present invention extracts notional word, can eliminate other in sentence
The interference of function word;Propose sentence similarity computational methods, effectively consider the meaning of a word to be mapped and the annotation of the candidate Chinese meaning of a word
With example sentence information.English-Chinese meaning of a word mapping method based on term vector proposed by the present invention and device, can be automatically performed knowledge base
The meaning of a word mapping, there is higher accuracy.English-Chinese meaning of a word mapping method based on term vector proposed by the present invention and device, be
A kind of full automatic meaning of a word mapping method, can avoid the loaded down with trivial details manual labor of traditional-handwork mapping method.The present invention
The English-Chinese meaning of a word mapping method based on term vector proposing and device, have given full play to the advantage of deep learning, using term vector
Technology generates sentence vector, being capable of the relatively accurately selection target meaning of a word, it is to avoid conventional machines translate the correct of mapping methods
The relatively low problem of rate.
Embodiment in this specification is described by the way of going forward one by one, mutually the same similar partly mutually referring to.
For device embodiment, because it is substantially similar to embodiment of the method, so describing fairly simple, correlation
Place illustrates referring to the part of embodiment of the method.
Above the embodiment of the present invention is described in detail, specific embodiment used herein is carried out to the present invention
Illustrate, the explanation of above example is only intended to help and understands methods and apparatus of the present invention;Simultaneously for this area one
As technical staff, according to the thought of the present invention, all will change in specific embodiments and applications, therefore this explanation
Book should not be construed as limitation of the present invention.