CN107894982A - A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean - Google Patents

A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean Download PDF

Info

Publication number
CN107894982A
CN107894982A CN201711005546.6A CN201711005546A CN107894982A CN 107894982 A CN107894982 A CN 107894982A CN 201711005546 A CN201711005546 A CN 201711005546A CN 107894982 A CN107894982 A CN 107894982A
Authority
CN
China
Prior art keywords
mrow
chinese
card
msub
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711005546.6A
Other languages
Chinese (zh)
Inventor
严馨
李思远
郭剑毅
周枫
王红斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201711005546.6A priority Critical patent/CN107894982A/en
Publication of CN107894982A publication Critical patent/CN107894982A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/49Data-driven translation using very large corpora, e.g. the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, belong to natural language processing field.The present invention first builds card Chinese word alignment Parallel Corpus, first word alignment is carried out at the parallel material storehouse of structure card Chinese word alignment using GIZA++, but the problem of Sparse is occurred due to GIZA++, reuse the fuzzy matching of bilingual dictionary and the method for term vector word similarity system design improves the accuracy rate of word alignment;Chinese dependency tree corpus is built again after the completion of card Chinese word alignment building of corpus;With reference to card Chinese word alignment corpus and Chinese dependency tree corpus and then card language dependency tree corpus is built, then by manually adjusting to obtain final card language dependency tree corpus.The method that interdependent treebank is built in the present invention simplifies the process of artificial mark Kampuchean sentence dependence, the plenty of time is saved, the accuracy rate of interdependent treebank can be effectively improved using bilingual dictionary matching and term vector similarity method structure bilingual word-alignment corpus.

Description

A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean
Technical field
The present invention relates to a kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, belong to nature language Say processing technology field.
Background technology
The structure of the interdependent treebank of card language is card language and the important step in Chinese intertranslation work, and the research to card language also has Vital effect.Currently, the politics of China and south east asia, economic interaction is frequent all the more, and Cambodia is as Southeast Asia The important country in area, its relation between China is also rather close, so the research work to card language exchanges for two countries Also seem particularly significant.The syntactic analysis of card language and the interdependent treebank structure of card language occupy very big ground in the work of research card language Position.The interdependent mark system of good card language and the interdependent treebank of card language can be to the morphology on card Chinese intertranslation work and card language upper strata point The application such as analysis, syntactic analysis, semantic analysis and machine translation improves a lot.
The content of the invention
The invention provides a kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, for solving The existing interdependent treebank imperfection of card language, card sentence to dependence be difficult to analysis the problems such as.
The technical scheme is that:A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, Methods described comprises the following steps that:
Step1, structure card Chinese word alignment Parallel Corpus;
Step1.1, collect card Chinese parallel sentence pair;
Step1.2, word alignment training is carried out using GIZA++ to card Chinese parallel sentence pair;
Step1.3, pass through fuzzy matching of the bilingual dictionary to sparse data progress dictionary;
Step1.4, the card words and phrases that can not be still alignd after the fuzzy matching of dictionary, using term vector similarity-rough set Method is handled for improving word alignment accuracy rate;Wherein term vector similarity-rough set refers to the Chinese that former sentence centering can not align Term vector corresponding to the Chinese translation for the card words and phrases that the term vector of word can not align with former sentence centering carries out similarity-rough set;
Step2, the Chinese dependency tree corpus of structure;
Step2.1, Chinese sentence word segmentation processing is carried out to card Chinese word alignment parallel sentence pair storehouse;
Step2.2, part-of-speech tagging processing is carried out to the Chinese language material after processing;
Step2.3, the interdependent treebank of LTP Language Processings platform construction Chinese is used to the Chinese language material after part-of-speech tagging, together When obtain Chinese dependence;
Step3, with reference to card Chinese word alignment Parallel Corpus and Chinese dependency tree corpus, build card language dependency tree language material Storehouse;
Step3.1, Chinese dependence is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair corpus Go, so as to obtain the interdependent treebank of card language;
Step3.2, the sub- dependence of card sentence built according to the interdependent treebank of card language, according to left in the sub- dependence of card sentence The change of right additional relationships is adjusted to the sub- dependence of card sentence, then by manual synchronizing, it is interdependent to obtain final card language Treebank.
Carry out the specific steps of the fuzzy matching of dictionary in the step Step1.3 to sparse data by bilingual dictionary such as Under:
Step1.3.1, the sparse data after word alignment is found out, i.e., any one there can not be the Chinese of alignment relation with card language Word;
Step1.3.2, with reference to the card Chinese dictionary fuzzy matching word alignment based on bilingual dictionary is carried out, in translating for Cambodia's word Collected works remove to calculate maximum that translation of similarity of Chinese word of can not being alignd with former sentence centering in closing, expression is as follows:
C in the formula1And c2Former sentence centering and the Chinese word in dictionary translation are represented respectively, | c1∩c2| it is c1And c2Contained The number of public word, | c1| and | c2| it is respectively c1And c2Contained number of words, Sim (c1,c2) it is Chinese word c1, c2Fuzzy matching phase Like degree;Thus definable, Cambodia word k and former sentence centering Chinese word c matching similarity are as follows:
Sim (k, c)=maxSim (d, c)
Wherein, d ∈ DTk, DTkFor Cambodia word k all Chinese translation set, the Chinese that Sim (d, c) is card words and phrases k is translated The similarity with Chinese word c, max are to take max function to text respectively, and Sim (k, c) is Cambodia word k and Chinese word c matching Similarity, in order to obtain Cambodia's word that matching similarity meets aligned condition, threshold θ is set, and
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing Method matching alignment.
The step Step1.4's comprises the following steps that:
Step1.4.1, by word2vec carry out Chinese data training, obtain Chinese language words term vector;
After the completion of Step1.4.2, training, the term vector w for the Chinese word that former sentence centering can not be alignd1With former sentence centering without Term vector w corresponding to the Chinese translation of the card words and phrases of method alignment2Carry out Similarity Measure, two term vector w1,w2Similarity It is expressed as below:
Wherein, term vector w1,w2For multi-C vector, n dimensions, w are shared1i,w2iIn i be vector dimension, and i=1, 2,…,n};Former sentence is as follows to the Cambodia word k that can not be alignd and former sentence centering the Chinese word c that can not be alignd matching similarity It is shown:
Sim (k, c)=maxSim (w1,w2)
Wherein, w1For Chinese word c term vector, w2For the term vector of Cambodia word k Chinese translation, maxSim (w1,w2) To take max function, representing to find in all Chinese translations for the card words and phrases k that can not be alignd can not align with former sentence in That most similar translator of Chinese word of Chinese word c semantemes, the similarity maximum is Sim (k, c), represent Cambodia word k With Chinese word c matching similarity;
In order to obtain two term vectors that similarity meets aligned condition, it is α to set a threshold value,
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing Method matching alignment;
If the matching phase for multiple card words and phrases that former one Chinese word that can not be alignd of sentence centering can not align with former sentence centering When meeting threshold condition simultaneously like degree, i.e.,
By Chinese word c1Respectively with card words and phrases k1,k2,…knAlignment.
The beneficial effects of the invention are as follows:The present invention is by GIZA++, and innovative introducing dictionary fuzzy matching and word Vector similitude matching several method is combined the bilingual parallel word alignment corpus of the card Chinese for constructing high-accuracy.Itd is proposed The method for building interdependent treebank simplifies the process of artificial mark Kampuchean sentence dependence, saves the plenty of time.Most The accuracy rate of the interdependent treebank of constructed Cambodia is effectively raised eventually.
Brief description of the drawings
Fig. 1 is that the total flow chart of the interdependent treebank of Kampuchean is built in the present invention;
The Chinese dependence schematic diagram of Fig. 2 positions present invention;
Fig. 3 is the Kampuchean dependence building process schematic diagram of the present invention.
Embodiment
Embodiment 1:As Figure 1-3, a kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, Methods described comprises the following steps that:
Step1, structure card Chinese word alignment Parallel Corpus;
Step1.1, collect card Chinese parallel sentence pair;
Step1.2, word alignment training is carried out using GIZA++ to card Chinese parallel sentence pair;
Step1.3, pass through fuzzy matching of the bilingual dictionary to sparse data progress dictionary;
Step1.4, the card words and phrases that can not be still alignd after the fuzzy matching of dictionary, using term vector similarity-rough set Method is handled for improving word alignment accuracy rate;Wherein term vector similarity-rough set refers to the Chinese that former sentence centering can not align Term vector corresponding to the Chinese translation for the card words and phrases that the term vector of word can not align with former sentence centering carries out similarity-rough set;
Step2, the Chinese dependency tree corpus of structure;
Step2.1, Chinese sentence word segmentation processing is carried out to card Chinese word alignment parallel sentence pair storehouse;
Step2.2, part-of-speech tagging processing is carried out to the Chinese language material after processing;
Step2.3, the interdependent treebank of LTP Language Processings platform construction Chinese is used to the Chinese language material after part-of-speech tagging, together When obtain Chinese dependence;As shown in Figure 2;
Step3, with reference to card Chinese word alignment Parallel Corpus and Chinese dependency tree corpus, build card language dependency tree language material Storehouse;
Step3.1, Chinese dependence is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair corpus Go, so as to obtain the interdependent treebank of card language;
Step3.2, the sub- dependence of card sentence built according to the interdependent treebank of card language, according to left in the sub- dependence of card sentence The change of right additional relationships is adjusted to the sub- dependence of card sentence, then by manual synchronizing, it is interdependent to obtain final card language Treebank.
Carry out the specific steps of the fuzzy matching of dictionary in the step Step1.3 to sparse data by bilingual dictionary such as Under:
Step1.3.1, the sparse data after word alignment is found out, i.e., any one there can not be the Chinese of alignment relation with card language Word;
Step1.3.2, with reference to the card Chinese dictionary fuzzy matching word alignment based on bilingual dictionary is carried out, in translating for Cambodia's word Collected works remove to calculate maximum that translation of similarity of Chinese word of can not being alignd with former sentence centering in closing, expression is as follows:
C in the formula1And c2Former sentence centering and the Chinese word in dictionary translation are represented respectively, | c1∩c2| it is c1And c2Contained The number of public word, | c1| and | c2| it is respectively c1And c2Contained number of words, Sim (c1,c2) it is Chinese word c1, c2Fuzzy matching phase Like degree;Thus definable, Cambodia word k and former sentence centering Chinese word c matching similarity are as follows:
Sim (k, c)=maxSim (d, c)
Wherein, d ∈ DTk, DTkFor Cambodia word k all Chinese translation set, the Chinese that Sim (d, c) is card words and phrases k is translated The similarity with Chinese word c, max are to take max function to text respectively, and Sim (k, c) is Cambodia word k and Chinese word c matching Similarity, in order to obtain Cambodia's word that matching similarity meets aligned condition, threshold θ is set, and
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing Method matching alignment.
The step Step1.4's comprises the following steps that:
Step1.4.1, by word2vec carry out Chinese data training, obtain Chinese language words term vector;
After the completion of Step1.4.2, training, the term vector w for the Chinese word that former sentence centering can not be alignd1With former sentence centering without Term vector w corresponding to the Chinese translation of the card words and phrases of method alignment2Carry out Similarity Measure, two term vector w1,w2Similarity It is expressed as below:
Wherein, term vector w1,w2For multi-C vector, n dimensions, w are shared1i,w2iIn i be vector dimension, and i=1, 2,…,n};Former sentence is as follows to the Cambodia word k that can not be alignd and former sentence centering the Chinese word c that can not be alignd matching similarity It is shown:
Sim (k, c)=maxSim (w1,w2)
Wherein, w1For Chinese word c term vector, w2For the term vector of Cambodia word k Chinese translation, maxSim (w1,w2) To take max function, representing to find in all Chinese translations for the card words and phrases k that can not be alignd can not align with former sentence in That most similar translator of Chinese word of Chinese word c semantemes, the similarity maximum is Sim (k, c), represent Cambodia word k With Chinese word c matching similarity;
In order to obtain two term vectors that similarity meets aligned condition, it is α to set a threshold value,
The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represent card words and phrases k with Former sentence centering Chinese word c semantic similarities, can match alignment;0 represents that card words and phrases k is unrelated with former sentence centering Chinese word c semantemes, nothing Method matching alignment;
If the matching phase for multiple card words and phrases that former one Chinese word that can not be alignd of sentence centering can not align with former sentence centering When meeting threshold condition simultaneously like degree, i.e.,
By Chinese word c1Respectively with card words and phrases k1,k2,…knAlignment.
In the step Step3.2, distinguished according to the grammer of Kampuchean and Chinese, it is interdependent that card sentence can be summed up Relationship change mode.That is the sentence word order problem of card language and Chinese has certain difference, and the left additional relationships in Chinese sentence are reflected Just no longer it is that left (right side) adds dependence after being mapped in card sentence, and right (left side) can be become and add dependence.At this moment The additional relation of Chinese cannot be applied mechanically again, and to be repaiied the dependence between vocabulary in card sentence according to adjustment criterion Just it is being correct additional relationships.Finally adjusted again by manual synchronizing, obtain the final interdependent treebank of card language.Card sentence is interdependent Relation adjustment algorithm can be:
The card sentence for having marked dependence that Input is obtained by mapping
Output adjusts card sentence after criterion modification according to dependence
Contain RAD (LAD) dependences Then in If input sentences
Do is adjusted to sentence or so additional relationships
Else does not adjust
Endif
Shown in Fig. 3, " this nut is hard, and without any taste ", " hard " core " ROOT " table for whole word Show;" nut " depends on " this ", and the relation between them is represented for fixed middle relation with " ATT ";" hard " depends on " hard Fruit ", the relation between them are represented for subject-predicate relation with " SBV ";Relation between " hard " and " not having " is coordination Represented with " COO ";" not having " with " and " relation be shape in relation with " ADV " represent;" not having " and the relation of " taste " are Dynamic guest's relation is represented with " VOB ";In Chinese dependence " any " with " " be to belong to right additional relation to use " RAD " Represent, but the word order of the Kampuchean obtained afterwards by bilingual word-alignment mapping has occurred and that change.In KampucheanMiddle word order is changed.Such as "(taste Road) " with "() " word order changes, and at this moment we cannot just apply mechanically the relation (RAD) that the right side of Chinese adds again, and will basis The modification rule specified by " taste " in card language and " " between dependence be modified to left additional relation (LAD).It is interdependent Syntactic relation is shown in table 1:
1 interdependent syntactic relation of table
By collecting card Chinese parallel sentence pair from internet in the present invention, and it is big by having been obtained after above three alignment procedure The card Chinese parallel sentence pair of about 10000, and form corresponding Chinese-card language parallel sentence pair storehouse.Analysis of Chinese sentence it is interdependent The instrument that relation uses is Harbin Institute of Technology's natural language processing cloud platform, will in the present invention in order to preferably use the instrument Its mark collection combines card language feature and has carried out corresponding modification, and is based on Chinese-card language alignment relation, generates 10000 The interdependent corpus of card language of bar.
The present invention innovatively introduces dictionary fuzzy matching and term vector similarity mode both approaches to building the card Chinese Word alignment Parallel Corpus is improved.Bilingual sentence is carried out to word alignment with conventional GIZA++ alignment schemes first, but due to The problem of Sparse occurs in this method so that obtained word alignment parallel sentence pair is not very correct, therefore reuses word Allusion quotation Method of Fuzzy Matching is further corrected, due to that may be lacked in dictionary and the former identical word of sentence centering translation Language, just carried out using term vector Similarity Match Method it is last perfect, so as to obtain an accurate card Chinese word alignment Parallel Corpus.The present invention compared with prior art, constructed bilingual word-alignment Parallel Corpus after improvement before, , can be effective when the Chinese dependence built is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair storehouse The accuracy rate of mapping is improved, so that the accuracy rate to the interdependent treebank of card language through mapping also improves therewith.
Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge Put that various changes can be made.

Claims (3)

  1. A kind of 1. method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, it is characterised in that:Methods described Comprise the following steps that:
    Step1, structure card Chinese word alignment Parallel Corpus;
    Step1.1, collect card Chinese parallel sentence pair;
    Step1.2, word alignment training is carried out using GIZA++ to card Chinese parallel sentence pair;
    Step1.3, pass through fuzzy matching of the bilingual dictionary to sparse data progress dictionary;
    Step1.4, the card words and phrases that can not be still alignd after the fuzzy matching of dictionary, using term vector similarity-rough set method Handle for improving word alignment accuracy rate;Wherein term vector similarity-rough set refers to the Chinese word that former sentence centering can not align Term vector corresponding to the Chinese translation for the card words and phrases that term vector can not align with former sentence centering carries out similarity-rough set;
    Step2, the Chinese dependency tree corpus of structure;
    Step2.1, Chinese sentence word segmentation processing is carried out to card Chinese word alignment parallel sentence pair storehouse;
    Step2.2, part-of-speech tagging processing is carried out to the Chinese language material after processing;
    Step2.3, the interdependent treebank of LTP Language Processings platform construction Chinese is used to the Chinese language material after part-of-speech tagging, simultaneously To Chinese dependence;
    Step3, with reference to card Chinese word alignment Parallel Corpus and Chinese dependency tree corpus, build card language dependency tree corpus;
    Step3.1, Chinese dependence is mapped in the sentence of card language by card Chinese word alignment parallel sentence pair corpus, from And obtain the interdependent treebank of card language;
    Step3.2, the sub- dependence of card sentence is built according to the interdependent treebank of card language, left and right is attached in the foundation sub- dependence of card sentence Add the change of relation to be adjusted the sub- dependence of card sentence, then by manual synchronizing, obtain the final interdependent treebank of card language.
  2. 2. the method according to claim 1 based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, its feature It is:Carry out comprising the following steps that for the fuzzy matching of dictionary in the step Step1.3 to sparse data by bilingual dictionary:
    Step1.3.1, the sparse data after word alignment is found out, i.e., any one there can not be the Chinese word of alignment relation with card language;
    Step1.3.2, with reference to the card Chinese dictionary fuzzy matching word alignment based on bilingual dictionary is carried out, in the collection of translations of Cambodia's word Remove to calculate maximum that translation of similarity of Chinese word of can not being alignd with former sentence centering in conjunction, expression is as follows:
    <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mn>2</mn> <mo>&amp;CenterDot;</mo> <mrow> <mo>|</mo> <mrow> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>&amp;cap;</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> </mrow> <mo>|</mo> </mrow> </mrow> <mrow> <mrow> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> <mo>+</mo> <mrow> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mrow> </mfrac> </mrow>
    C in the formula1And c2Former sentence centering and the Chinese word in dictionary translation are represented respectively, | c1∩c2| it is c1And c2Contained is public The number of word, | c1| and | c2| it is respectively c1And c2Contained number of words, Sim (c1,c2) it is Chinese word c1, c2Fuzzy matching similarity; Thus definable, Cambodia word k and former sentence centering Chinese word c matching similarity are as follows:
    Sim (k, c)=maxSim (d, c)
    Wherein, d ∈ DTk, DTkFor Cambodia word k all Chinese translation set, the Chinese translation point that Sim (d, c) is card words and phrases k Not with Chinese word c similarity, for max to take max function, Sim (k, c) is that Cambodia word k is similar with Chinese word c matching Degree, in order to obtain Cambodia's word that matching similarity meets aligned condition, threshold θ is set, and
    <mrow> <mi>a</mi> <mi>l</mi> <mi>i</mi> <mi>g</mi> <mi>n</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>&amp;GreaterEqual;</mo> <mi>&amp;theta;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>0</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>&lt;</mo> <mi>&amp;theta;</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>
    The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represents card words and phrases k and former sentence Centering Chinese word c semantic similarities, can match alignment;0 represent card words and phrases k it is unrelated with former sentence centering Chinese word c semantemes, can not With alignment.
  3. 3. the method according to claim 1 based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean, its feature It is:The step Step1.4's comprises the following steps that:
    Step1.4.1, by word2vec carry out Chinese data training, obtain Chinese language words term vector;
    After the completion of Step1.4.2, training, the term vector w for the Chinese word that former sentence centering can not be alignd1Can not be right with former sentence centering Term vector w corresponding to the Chinese translation of neat card words and phrases2Carry out Similarity Measure, two term vector w1,w2Similarity it is as follows Represent:
    <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&amp;CenterDot;</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>|</mo> <mo>|</mo> <mo>&amp;CenterDot;</mo> <mo>|</mo> <mo>|</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>=</mo> <mfrac> <mrow> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> </msub> <mo>&amp;CenterDot;</mo> <msub> <mi>w</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msqrt> <mrow> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msup> <msub> <mi>w</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> </msub> <mn>2</mn> </msup> </mrow> </msqrt> <mo>&amp;CenterDot;</mo> <msqrt> <mrow> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msup> <msub> <mi>w</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> </msub> <mn>2</mn> </msup> </mrow> </msqrt> </mrow> </mfrac> </mrow>
    Wherein, term vector w1,w2For multi-C vector, n dimensions, w are shared1i,w2iIn i be vector dimension, and i=1,2 ..., n};Former sentence is as follows to the Cambodia word k that can not be alignd and former sentence centering the Chinese word c that can not be alignd matching similarity:
    Sim (k, c)=maxSim (w1,w2)
    Wherein, w1For Chinese word c term vector, w2For the term vector of Cambodia word k Chinese translation, maxSim (w1,w2) it is to take Max function, represent to find the Chinese that can not be alignd in former sentence in all Chinese translations for the card words and phrases k that can not be alignd That most similar translator of Chinese word of words and phrases c semantemes, the similarity maximum is Sim (k, c), represents Cambodia word k and the Chinese Words and phrases c matching similarity;
    In order to obtain two term vectors that similarity meets aligned condition, it is α to set a threshold value,
    <mrow> <mi>a</mi> <mi>l</mi> <mi>i</mi> <mi>g</mi> <mi>n</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>&amp;GreaterEqual;</mo> <mi>&amp;alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>0</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>&lt;</mo> <mi>&amp;alpha;</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>
    The left side is Cambodia word k and Chinese word c alignment function in formula, and value is 1 and 0;Wherein 1 represents card words and phrases k and former sentence Centering Chinese word c semantic similarities, can match alignment;0 represent card words and phrases k it is unrelated with former sentence centering Chinese word c semantemes, can not With alignment;
    If the matching similarity for multiple card words and phrases that former one Chinese word that can not be alignd of sentence centering can not align with former sentence centering When meeting threshold condition simultaneously, i.e.,
    <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>&amp;GreaterEqual;</mo> <mi>&amp;alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>&amp;GreaterEqual;</mo> <mi>&amp;alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>&amp;GreaterEqual;</mo> <mi>&amp;alpha;</mi> </mrow> </mtd> </mtr> </mtable> </mfenced>
    By Chinese word c1Respectively with card words and phrases k1,k2,…knAlignment.
CN201711005546.6A 2017-10-25 2017-10-25 A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean Pending CN107894982A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711005546.6A CN107894982A (en) 2017-10-25 2017-10-25 A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711005546.6A CN107894982A (en) 2017-10-25 2017-10-25 A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean

Publications (1)

Publication Number Publication Date
CN107894982A true CN107894982A (en) 2018-04-10

Family

ID=61803738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711005546.6A Pending CN107894982A (en) 2017-10-25 2017-10-25 A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean

Country Status (1)

Country Link
CN (1) CN107894982A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582951A (en) * 2018-10-19 2019-04-05 昆明理工大学 A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm
CN110008467A (en) * 2019-03-04 2019-07-12 昆明理工大学 A kind of interdependent syntactic analysis method of Burmese based on transfer learning
CN110175585A (en) * 2019-05-30 2019-08-27 北京林业大学 It is a kind of letter answer correct system and method automatically

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991890A (en) * 2015-07-15 2015-10-21 昆明理工大学 Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora
CN105446958A (en) * 2014-07-18 2016-03-30 富士通株式会社 Word aligning method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446958A (en) * 2014-07-18 2016-03-30 富士通株式会社 Word aligning method and device
CN104991890A (en) * 2015-07-15 2015-10-21 昆明理工大学 Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王明文 等: "基于word2vec的大中华区词对齐库的构建", 《中文信息学报》 *
邓丹 等: "基于双语词典的汉英词语对齐算法研究", 《计算机工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582951A (en) * 2018-10-19 2019-04-05 昆明理工大学 A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm
CN110008467A (en) * 2019-03-04 2019-07-12 昆明理工大学 A kind of interdependent syntactic analysis method of Burmese based on transfer learning
CN110175585A (en) * 2019-05-30 2019-08-27 北京林业大学 It is a kind of letter answer correct system and method automatically
CN110175585B (en) * 2019-05-30 2024-01-23 北京林业大学 Automatic correcting system and method for simple answer questions

Similar Documents

Publication Publication Date Title
WO2021008180A1 (en) Software defect knowledge-oriented knowledge search method
Pouget-Abadie et al. Overcoming the curse of sentence length for neural machine translation using automatic segmentation
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN111310480B (en) Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
CN104615724B (en) The foundation of knowledge base and the information search method and device in knowledge based storehouse
CN101676898B (en) Method and device for translating Chinese organization name into English with the aid of network knowledge
CN107894982A (en) A kind of method based on the card Chinese word alignment language material structure interdependent treebank of Kampuchean
CN109284352A (en) A kind of querying method of the assessment class document random length words and phrases based on inverted index
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN107463553A (en) For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN104268133B (en) machine translation method and system
CN104268132A (en) Machine translation method and system
CN107329961A (en) A kind of method of cloud translation memory library Fast incremental formula fuzzy matching
KR20120089502A (en) Method of generating translation knowledge server and apparatus for the same
CN104102630A (en) Method for standardizing Chinese and English hybrid texts in Chinese social networks
CN105740218A (en) Post-editing processing method for mechanical translation
CN107402916A (en) The segmenting method and device of Chinese text
CN108287825A (en) A kind of term identification abstracting method and system
CN106445911A (en) Anaphora resolution method and system based on microscopic topic structure
CN106156013A (en) The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN108491399A (en) Chinese to English machine translation method based on context iterative analysis
Dandapat et al. Using example-based MT to support statistical MT when translating homogeneous data in a resource-poor setting
Ferreira et al. Surface realization shared task 2018 (sr18): The tilburg university approach
CN104239292B (en) A kind of method for obtaining specialized vocabulary translation
CN107229613A (en) A kind of English-Chinese corpus extraction method based on vector space model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180410