CN107894982A - A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean - Google Patents
A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean Download PDFInfo
- Publication number
- CN107894982A CN107894982A CN201711005546.6A CN201711005546A CN107894982A CN 107894982 A CN107894982 A CN 107894982A CN 201711005546 A CN201711005546 A CN 201711005546A CN 107894982 A CN107894982 A CN 107894982A
- Authority
- CN
- China
- Prior art keywords
- mrow
- chinese
- card
- msub
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 239000000463 material Substances 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 47
- 238000013519 translation Methods 0.000 claims description 25
- 230000014616 translation Effects 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000011524 similarity measure Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000013507 mapping Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/49—Data-driven translation using very large corpora, e.g. the web
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to the method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, belong to natural language processing field.The present invention first builds card Chinese word alignment Parallel Corpus, first word alignment is carried out at the parallel material storehouse of structure card Chinese word alignment using GIZA++, but the problem of Sparse is occurred due to GIZA++, reuse the fuzzy matching of bilingual dictionary and the method for term vector word similarity system design improves the accuracy rate of word alignment;Chinese dependency tree corpus is built again after the completion of card Chinese word alignment building of corpus;With reference to card Chinese word alignment corpus and Chinese dependency tree corpus and then card language dependency tree corpus is built, then by manually adjusting to obtain final card language dependency tree corpus.The method that interdependent treebank is built in the present invention simplifies the process of artificial mark Kampuchean sentence dependence, the plenty of time is saved, the accuracy rate of interdependent treebank can be effectively improved using bilingual dictionary matching and term vector similarity method structure bilingual word-alignment corpus.
Description
Technical field
The present invention relates to a kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, belong to nature language
Say processing technology field.
Background technology
The structure of the interdependent treebank of card language is card language and the important step in Chinese intertranslation work, and the research to card language also has
Vital effect.Currently, the politics of China and south east asia, economic interaction is frequent all the more, and Cambodia is as Southeast Asia
The important country in area, its relation between China is also rather close, so the research work to card language exchanges for two countries
Also seem particularly significant.The syntactic analysis of card language and the interdependent treebank structure of card language occupy very big ground in the work of research card language
Position.The interdependent mark system of good card language and the interdependent treebank of card language can be to the morphology on card Chinese intertranslation work and card language upper strata point
The application such as analysis, syntactic analysis, semantic analysis and machine translation improves a lot.
The content of the invention
The invention provides a kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, for solving
The existing interdependent treebank imperfection of card language, card sentence to dependence be difficult to analysis the problems such as.
The technical scheme is that:A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean,
Methods described comprises the following steps that:
Step1, structure card Chinese word alignment Parallel Corpus;
Step1.1, collect card Chinese parallel sentence pair;
Step1.2, word alignment training is carried out using GIZA++ to card Chinese parallel sentence pair;
Step1.3, pass through fuzzy matching of the bilingual dictionary to sparse data progress dictionary;
Step1.4, the card words and phrases that can not be still alignd after the fuzzy matching of dictionary, using term vector similarity-rough set
Method is handled for improving word alignment accuracy rate;Wherein term vector similarity-rough set refers to the Chinese that former sentence centering can not align
Term vector corresponding to the Chinese translation for the card words and phrases that the term vector of word can not align with former sentence centering carries out similarity-rough set;
Step2, the Chinese dependency tree corpus of structure;
Step2.1, Chinese sentence word segmentation processing is carried out to card Chinese word alignment parallel sentence pair storehouse;
Step2.2, part-of-speech tagging processing is carried out to the Chinese language material after processing;
Step2.3, the interdependent treebank of LTP Language Processings platform construction Chinese is used to the Chinese language material after part-of-speech tagging, together
When obtain Chinese dependence;
Step3, with reference to card Chinese word alignment Parallel Corpus and Chinese dependency tree corpus, build card language dependency tree language material
Storehouse;
Step3.1, Chinese dependence is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair corpus
Go, so as to obtain the interdependent treebank of card language;
Step3.2, the sub- dependence of card sentence built according to the interdependent treebank of card language, according to left in the sub- dependence of card sentence
The change of right additional relationships is adjusted to the sub- dependence of card sentence, then by manual synchronizing, it is interdependent to obtain final card language
Treebank.
Carry out the specific steps of the fuzzy matching of dictionary in the step Step1.3 to sparse data by bilingual dictionary such as
Under:
Step1.3.1, the sparse data after word alignment is found out, i.e., any one there can not be the Chinese of alignment relation with card language
Word;
Step1.3.2, with reference to the card Chinese dictionary fuzzy matching word alignment based on bilingual dictionary is carried out, in translating for Cambodia's word
Collected works remove to calculate maximum that translation of similarity of Chinese word of can not being alignd with former sentence centering in closing, expression is as follows:
C in the formula1And c2Former sentence centering and the Chinese word in dictionary translation are represented respectively, | c1∩c2| it is c1And c2Contained
The number of public word, | c1| and | c2| it is respectively c1And c2Contained number of words, Sim (c1,c2) it is Chinese word c1, c2Fuzzy matching phase
Like degree;Thus definable, Cambodia word k and former sentence centering Chinese word c matching similarity are as follows:
Sim (k, c)=maxSim (d, c)
Wherein, d ∈ DTk, DTkFor Cambodia word k all Chinese translation set, the Chinese that Sim (d, c) is card words and phrases k is translated
The similarity with Chinese word c, max are to take max function to text respectively, and Sim (k, c) is Cambodia word k and Chinese word c matching
Similarity, in order to obtain Cambodia's word that matching similarity meets aligned condition, threshold θ is set, and
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with
Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing
Method matching alignment.
The step Step1.4's comprises the following steps that:
Step1.4.1, by word2vec carry out Chinese data training, obtain Chinese language words term vector;
After the completion of Step1.4.2, training, the term vector w for the Chinese word that former sentence centering can not be alignd1With former sentence centering without
Term vector w corresponding to the Chinese translation of the card words and phrases of method alignment2Carry out Similarity Measure, two term vector w1,w2Similarity
It is expressed as below:
Wherein, term vector w1,w2For multi-C vector, n dimensions, w are shared1i,w2iIn i be vector dimension, and i=1,
2,…,n};Former sentence is as follows to the Cambodia word k that can not be alignd and former sentence centering the Chinese word c that can not be alignd matching similarity
It is shown:
Sim (k, c)=maxSim (w1,w2)
Wherein, w1For Chinese word c term vector, w2For the term vector of Cambodia word k Chinese translation, maxSim (w1,w2)
To take max function, representing to find in all Chinese translations for the card words and phrases k that can not be alignd can not align with former sentence in
That most similar translator of Chinese word of Chinese word c semantemes, the similarity maximum is Sim (k, c), represent Cambodia word k
With Chinese word c matching similarity;
In order to obtain two term vectors that similarity meets aligned condition, it is α to set a threshold value,
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with
Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing
Method matching alignment;
If the matching phase for multiple card words and phrases that former one Chinese word that can not be alignd of sentence centering can not align with former sentence centering
When meeting threshold condition simultaneously like degree, i.e.,
By Chinese word c1Respectively with card words and phrases k1,k2,…knAlignment.
The beneficial effects of the invention are as follows:The present invention is by GIZA++, and innovative introducing dictionary fuzzy matching and word
Vector similitude matching several method is combined the bilingual parallel word alignment corpus of the card Chinese for constructing high-accuracy.Itd is proposed
The method for building interdependent treebank simplifies the process of artificial mark Kampuchean sentence dependence, saves the plenty of time.Most
The accuracy rate of the interdependent treebank of constructed Cambodia is effectively raised eventually.
Brief description of the drawings
Fig. 1 is that the total flow chart of the interdependent treebank of Kampuchean is built in the present invention;
The Chinese dependence schematic diagram of Fig. 2 positions present invention;
Fig. 3 is the Kampuchean dependence building process schematic diagram of the present invention.
Embodiment
Embodiment 1:As Figure 1-3, a kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean,
Methods described comprises the following steps that:
Step1, structure card Chinese word alignment Parallel Corpus;
Step1.1, collect card Chinese parallel sentence pair;
Step1.2, word alignment training is carried out using GIZA++ to card Chinese parallel sentence pair;
Step1.3, pass through fuzzy matching of the bilingual dictionary to sparse data progress dictionary;
Step1.4, the card words and phrases that can not be still alignd after the fuzzy matching of dictionary, using term vector similarity-rough set
Method is handled for improving word alignment accuracy rate;Wherein term vector similarity-rough set refers to the Chinese that former sentence centering can not align
Term vector corresponding to the Chinese translation for the card words and phrases that the term vector of word can not align with former sentence centering carries out similarity-rough set;
Step2, the Chinese dependency tree corpus of structure;
Step2.1, Chinese sentence word segmentation processing is carried out to card Chinese word alignment parallel sentence pair storehouse;
Step2.2, part-of-speech tagging processing is carried out to the Chinese language material after processing;
Step2.3, the interdependent treebank of LTP Language Processings platform construction Chinese is used to the Chinese language material after part-of-speech tagging, together
When obtain Chinese dependence;As shown in Figure 2;
Step3, with reference to card Chinese word alignment Parallel Corpus and Chinese dependency tree corpus, build card language dependency tree language material
Storehouse;
Step3.1, Chinese dependence is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair corpus
Go, so as to obtain the interdependent treebank of card language;
Step3.2, the sub- dependence of card sentence built according to the interdependent treebank of card language, according to left in the sub- dependence of card sentence
The change of right additional relationships is adjusted to the sub- dependence of card sentence, then by manual synchronizing, it is interdependent to obtain final card language
Treebank.
Carry out the specific steps of the fuzzy matching of dictionary in the step Step1.3 to sparse data by bilingual dictionary such as
Under:
Step1.3.1, the sparse data after word alignment is found out, i.e., any one there can not be the Chinese of alignment relation with card language
Word;
Step1.3.2, with reference to the card Chinese dictionary fuzzy matching word alignment based on bilingual dictionary is carried out, in translating for Cambodia's word
Collected works remove to calculate maximum that translation of similarity of Chinese word of can not being alignd with former sentence centering in closing, expression is as follows:
C in the formula1And c2Former sentence centering and the Chinese word in dictionary translation are represented respectively, | c1∩c2| it is c1And c2Contained
The number of public word, | c1| and | c2| it is respectively c1And c2Contained number of words, Sim (c1,c2) it is Chinese word c1, c2Fuzzy matching phase
Like degree;Thus definable, Cambodia word k and former sentence centering Chinese word c matching similarity are as follows:
Sim (k, c)=maxSim (d, c)
Wherein, d ∈ DTk, DTkFor Cambodia word k all Chinese translation set, the Chinese that Sim (d, c) is card words and phrases k is translated
The similarity with Chinese word c, max are to take max function to text respectively, and Sim (k, c) is Cambodia word k and Chinese word c matching
Similarity, in order to obtain Cambodia's word that matching similarity meets aligned condition, threshold θ is set, and
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with
Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing
Method matching alignment.
The step Step1.4's comprises the following steps that:
Step1.4.1, by word2vec carry out Chinese data training, obtain Chinese language words term vector;
After the completion of Step1.4.2, training, the term vector w for the Chinese word that former sentence centering can not be alignd1With former sentence centering without
Term vector w corresponding to the Chinese translation of the card words and phrases of method alignment2Carry out Similarity Measure, two term vector w1,w2Similarity
It is expressed as below:
Wherein, term vector w1,w2For multi-C vector, n dimensions, w are shared1i,w2iIn i be vector dimension, and i=1,
2,…,n};Former sentence is as follows to the Cambodia word k that can not be alignd and former sentence centering the Chinese word c that can not be alignd matching similarity
It is shown:
Sim (k, c)=maxSim (w1,w2)
Wherein, w1For Chinese word c term vector, w2For the term vector of Cambodia word k Chinese translation, maxSim (w1,w2)
To take max function, representing to find in all Chinese translations for the card words and phrases k that can not be alignd can not align with former sentence in
That most similar translator of Chinese word of Chinese word c semantemes, the similarity maximum is Sim (k, c), represent Cambodia word k
With Chinese word c matching similarity;
In order to obtain two term vectors that similarity meets aligned condition, it is α to set a threshold value,
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with
Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing
Method matching alignment;
If the matching phase for multiple card words and phrases that former one Chinese word that can not be alignd of sentence centering can not align with former sentence centering
When meeting threshold condition simultaneously like degree, i.e.,
By Chinese word c1Respectively with card words and phrases k1,k2,…knAlignment.
In the step Step3.2, distinguished according to the grammer of Kampuchean and Chinese, it is interdependent that card sentence can be summed up
Relationship change mode.That is the sentence word order problem of card language and Chinese has certain difference, and the left additional relationships in Chinese sentence are reflected
Just no longer it is that left (right side) adds dependence after being mapped in card sentence, and right (left side) can be become and add dependence.At this moment
The additional relation of Chinese cannot be applied mechanically again, and to be repaiied the dependence between vocabulary in card sentence according to adjustment criterion
Just it is being correct additional relationships.Finally adjusted again by manual synchronizing, obtain the final interdependent treebank of card language.Card sentence is interdependent
Relation adjustment algorithm can be:
The card sentence for having marked dependence that Input is obtained by mapping
Output adjusts card sentence after criterion modification according to dependence
Contain RAD (LAD) dependences Then in If input sentences
Do is adjusted to sentence or so additional relationships
Else does not adjust
Endif
Shown in Fig. 3, " this nut is hard, and without any taste ", " hard " core " ROOT " table for whole word
Show;" nut " depends on " this ", and the relation between them is represented for fixed middle relation with " ATT ";" hard " depends on " hard
Fruit ", the relation between them are represented for subject-predicate relation with " SBV ";Relation between " hard " and " not having " is coordination
Represented with " COO ";" not having " with " and " relation be shape in relation with " ADV " represent;" not having " and the relation of " taste " are
Dynamic guest's relation is represented with " VOB ";In Chinese dependence " any " with " " be to belong to right additional relation to use " RAD "
Represent, but the word order of the Kampuchean obtained afterwards by bilingual word-alignment mapping has occurred and that change.In KampucheanMiddle word order is changed.Such as "(taste
Road) " with "() " word order changes, and at this moment we cannot just apply mechanically the relation (RAD) that the right side of Chinese adds again, and will basis
The modification rule specified by " taste " in card language and " " between dependence be modified to left additional relation (LAD).It is interdependent
Syntactic relation is shown in table 1:
1 interdependent syntactic relation of table
By collecting card Chinese parallel sentence pair from internet in the present invention, and it is big by having been obtained after above three alignment procedure
The card Chinese parallel sentence pair of about 10000, and form corresponding Chinese-card language parallel sentence pair storehouse.Analysis of Chinese sentence it is interdependent
The instrument that relation uses is Harbin Institute of Technology's natural language processing cloud platform, will in the present invention in order to preferably use the instrument
Its mark collection combines card language feature and has carried out corresponding modification, and is based on Chinese-card language alignment relation, generates 10000
The interdependent corpus of card language of bar.
The present invention innovatively introduces dictionary fuzzy matching and term vector similarity mode both approaches to building the card Chinese
Word alignment Parallel Corpus is improved.Bilingual sentence is carried out to word alignment with conventional GIZA++ alignment schemes first, but due to
The problem of Sparse occurs in this method so that obtained word alignment parallel sentence pair is not very correct, therefore reuses word
Allusion quotation Method of Fuzzy Matching is further corrected, due to that may be lacked in dictionary and the former identical word of sentence centering translation
Language, just carried out using term vector Similarity Match Method it is last perfect, so as to obtain an accurate card Chinese word alignment
Parallel Corpus.The present invention compared with prior art, constructed bilingual word-alignment Parallel Corpus after improvement before,
, can be effective when the Chinese dependence built is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair storehouse
The accuracy rate of mapping is improved, so that the accuracy rate to the interdependent treebank of card language through mapping also improves therewith.
Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned
Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge
Put that various changes can be made.
Claims (3)
- A kind of 1. method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, it is characterised in that:Methods described Comprise the following steps that:Step1, structure card Chinese word alignment Parallel Corpus;Step1.1, collect card Chinese parallel sentence pair;Step1.2, word alignment training is carried out using GIZA++ to card Chinese parallel sentence pair;Step1.3, pass through fuzzy matching of the bilingual dictionary to sparse data progress dictionary;Step1.4, the card words and phrases that can not be still alignd after the fuzzy matching of dictionary, using term vector similarity-rough set method Handle for improving word alignment accuracy rate;Wherein term vector similarity-rough set refers to the Chinese word that former sentence centering can not align Term vector corresponding to the Chinese translation for the card words and phrases that term vector can not align with former sentence centering carries out similarity-rough set;Step2, the Chinese dependency tree corpus of structure;Step2.1, Chinese sentence word segmentation processing is carried out to card Chinese word alignment parallel sentence pair storehouse;Step2.2, part-of-speech tagging processing is carried out to the Chinese language material after processing;Step2.3, the interdependent treebank of LTP Language Processings platform construction Chinese is used to the Chinese language material after part-of-speech tagging, simultaneously To Chinese dependence;Step3, with reference to card Chinese word alignment Parallel Corpus and Chinese dependency tree corpus, build card language dependency tree corpus;Step3.1, Chinese dependence is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair corpus, from And obtain the interdependent treebank of card language;Step3.2, the sub- dependence of card sentence is built according to the interdependent treebank of card language, left and right is attached in the foundation sub- dependence of card sentence Add the change of relation to be adjusted the sub- dependence of card sentence, then by manual synchronizing, obtain the final interdependent treebank of card language.
- 2. the method according to claim 1 based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, its feature It is:Carry out comprising the following steps that for the fuzzy matching of dictionary in the step Step1.3 to sparse data by bilingual dictionary:Step1.3.1, the sparse data after word alignment is found out, i.e., any one there can not be the Chinese word of alignment relation with card language;Step1.3.2, with reference to the card Chinese dictionary fuzzy matching word alignment based on bilingual dictionary is carried out, in the collection of translations of Cambodia's word Remove to calculate maximum that translation of similarity of Chinese word of can not being alignd with former sentence centering in conjunction, expression is as follows:<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mn>2</mn> <mo>&CenterDot;</mo> <mrow> <mo>|</mo> <mrow> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>&cap;</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> </mrow> <mo>|</mo> </mrow> </mrow> <mrow> <mrow> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> <mo>+</mo> <mrow> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mrow> </mfrac> </mrow>C in the formula1And c2Former sentence centering and the Chinese word in dictionary translation are represented respectively, | c1∩c2| it is c1And c2Contained is public The number of word, | c1| and | c2| it is respectively c1And c2Contained number of words, Sim (c1,c2) it is Chinese word c1, c2Fuzzy matching similarity; Thus definable, Cambodia word k and former sentence centering Chinese word c matching similarity are as follows:Sim (k, c)=maxSim (d, c)Wherein, d ∈ DTk, DTkFor Cambodia word k all Chinese translation set, the Chinese translation point that Sim (d, c) is card words and phrases k Not with Chinese word c similarity, for max to take max function, Sim (k, c) is that Cambodia word k is similar with Chinese word c matching Degree, in order to obtain Cambodia's word that matching similarity meets aligned condition, threshold θ is set, and<mrow> <mi>a</mi> <mi>l</mi> <mi>i</mi> <mi>g</mi> <mi>n</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>&theta;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>0</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo><</mo> <mi>&theta;</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represents card words and phrases k and former sentence Centering Chinese word c semantic similarities, can match alignment;0 represent card words and phrases k it is unrelated with former sentence centering Chinese word c semantemes, can not With alignment.
- 3. the method according to claim 1 based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, its feature It is:The step Step1.4's comprises the following steps that:Step1.4.1, by word2vec carry out Chinese data training, obtain Chinese language words term vector;After the completion of Step1.4.2, training, the term vector w for the Chinese word that former sentence centering can not be alignd1Can not be right with former sentence centering Term vector w corresponding to the Chinese translation of neat card words and phrases2Carry out Similarity Measure, two term vector w1,w2Similarity it is as follows Represent:<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>|</mo> <mo>|</mo> <mo>&CenterDot;</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> </msub> <mo>&CenterDot;</mo> <msub> <mi>w</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msqrt> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msup> <msub> <mi>w</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> </msub> <mn>2</mn> </msup> </mrow> </msqrt> <mo>&CenterDot;</mo> <msqrt> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msup> <msub> <mi>w</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> </msub> <mn>2</mn> </msup> </mrow> </msqrt> </mrow> </mfrac> </mrow>Wherein, term vector w1,w2For multi-C vector, n dimensions, w are shared1i,w2iIn i be vector dimension, and i=1,2 ..., n};Former sentence is as follows to the Cambodia word k that can not be alignd and former sentence centering the Chinese word c that can not be alignd matching similarity:Sim (k, c)=maxSim (w1,w2)Wherein, w1For Chinese word c term vector, w2For the term vector of Cambodia word k Chinese translation, maxSim (w1,w2) it is to take Max function, represent to find the Chinese that can not be alignd in former sentence in all Chinese translations for the card words and phrases k that can not be alignd That most similar translator of Chinese word of words and phrases c semantemes, the similarity maximum is Sim (k, c), represents Cambodia word k and the Chinese Words and phrases c matching similarity;In order to obtain two term vectors that similarity meets aligned condition, it is α to set a threshold value,<mrow> <mi>a</mi> <mi>l</mi> <mi>i</mi> <mi>g</mi> <mi>n</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>&alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>0</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo><</mo> <mi>&alpha;</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represents card words and phrases k and former sentence Centering Chinese word c semantic similarities, can match alignment;0 represent card words and phrases k it is unrelated with former sentence centering Chinese word c semantemes, can not With alignment;If the matching similarity for multiple card words and phrases that former one Chinese word that can not be alignd of sentence centering can not align with former sentence centering When meeting threshold condition simultaneously, i.e.,<mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>&alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>&alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>&alpha;</mi> </mrow> </mtd> </mtr> </mtable> </mfenced>By Chinese word c1Respectively with card words and phrases k1,k2,…knAlignment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711005546.6A CN107894982A (en) | 2017-10-25 | 2017-10-25 | A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711005546.6A CN107894982A (en) | 2017-10-25 | 2017-10-25 | A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107894982A true CN107894982A (en) | 2018-04-10 |
Family
ID=61803738
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711005546.6A Pending CN107894982A (en) | 2017-10-25 | 2017-10-25 | A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107894982A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582951A (en) * | 2018-10-19 | 2019-04-05 | 昆明理工大学 | A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm |
CN110008467A (en) * | 2019-03-04 | 2019-07-12 | 昆明理工大学 | A kind of interdependent syntactic analysis method of Burmese based on transfer learning |
CN110175585A (en) * | 2019-05-30 | 2019-08-27 | 北京林业大学 | It is a kind of letter answer correct system and method automatically |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104991890A (en) * | 2015-07-15 | 2015-10-21 | 昆明理工大学 | Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora |
CN105446958A (en) * | 2014-07-18 | 2016-03-30 | 富士通株式会社 | Word aligning method and device |
-
2017
- 2017-10-25 CN CN201711005546.6A patent/CN107894982A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105446958A (en) * | 2014-07-18 | 2016-03-30 | 富士通株式会社 | Word aligning method and device |
CN104991890A (en) * | 2015-07-15 | 2015-10-21 | 昆明理工大学 | Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora |
Non-Patent Citations (2)
Title |
---|
王明文 等: "基于word2vec的大中华区词对齐库的构建", 《中文信息学报》 * |
邓丹 等: "基于双语词典的汉英词语对齐算法研究", 《计算机工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582951A (en) * | 2018-10-19 | 2019-04-05 | 昆明理工大学 | A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm |
CN110008467A (en) * | 2019-03-04 | 2019-07-12 | 昆明理工大学 | A kind of interdependent syntactic analysis method of Burmese based on transfer learning |
CN110175585A (en) * | 2019-05-30 | 2019-08-27 | 北京林业大学 | It is a kind of letter answer correct system and method automatically |
CN110175585B (en) * | 2019-05-30 | 2024-01-23 | 北京林业大学 | Automatic correcting system and method for simple answer questions |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021008180A1 (en) | Software defect knowledge-oriented knowledge search method | |
Pouget-Abadie et al. | Overcoming the curse of sentence length for neural machine translation using automatic segmentation | |
CN105975625A (en) | Chinglish inquiring correcting method and system oriented to English search engine | |
CN111310480B (en) | Weakly supervised Hanyue bilingual dictionary construction method based on English pivot | |
CN104615724B (en) | The foundation of knowledge base and the information search method and device in knowledge based storehouse | |
CN101676898B (en) | Method and device for translating Chinese organization name into English with the aid of network knowledge | |
CN107894982A (en) | A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean | |
CN109284352A (en) | A kind of querying method of the assessment class document random length words and phrases based on inverted index | |
CN102693279B (en) | Method, device and system for fast calculating comment similarity | |
CN107463553A (en) | For the text semantic extraction, expression and modeling method and system of elementary mathematics topic | |
CN104268133B (en) | machine translation method and system | |
CN104268132A (en) | Machine translation method and system | |
CN107329961A (en) | A kind of method of cloud translation memory library Fast incremental formula fuzzy matching | |
KR20120089502A (en) | Method of generating translation knowledge server and apparatus for the same | |
CN104102630A (en) | Method for standardizing Chinese and English hybrid texts in Chinese social networks | |
CN105740218A (en) | Post-editing processing method for mechanical translation | |
CN107402916A (en) | The segmenting method and device of Chinese text | |
CN108287825A (en) | A kind of term identification abstracting method and system | |
CN106445911A (en) | Anaphora resolution method and system based on microscopic topic structure | |
CN106156013A (en) | The two-part machine translation method that a kind of regular collocation type phrase is preferential | |
CN108491399A (en) | Chinese to English machine translation method based on context iterative analysis | |
Dandapat et al. | Using example-based MT to support statistical MT when translating homogeneous data in a resource-poor setting | |
Ferreira et al. | Surface realization shared task 2018 (sr18): The tilburg university approach | |
CN104239292B (en) | A kind of method for obtaining specialized vocabulary translation | |
CN107229613A (en) | A kind of English-Chinese corpus extraction method based on vector space model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180410 |