CN110414009A - The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device - Google Patents

The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device Download PDF

Info

Publication number
CN110414009A
CN110414009A CN201910873805.XA CN201910873805A CN110414009A CN 110414009 A CN110414009 A CN 110414009A CN 201910873805 A CN201910873805 A CN 201910873805A CN 110414009 A CN110414009 A CN 110414009A
Authority
CN
China
Prior art keywords
sentence
english
word
burmese
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910873805.XA
Other languages
Chinese (zh)
Other versions
CN110414009B (en
Inventor
毛存礼
梁昊远
余正涛
张少宁
张亚飞
朱浩东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Publication of CN110414009A publication Critical patent/CN110414009A/en
Application granted granted Critical
Publication of CN110414009B publication Critical patent/CN110414009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN and devices, belong to natural language processing technique field.The present invention passes through Muse tool pre-training first and goes out bilingual term vector, then functional label is carried out to sentence using the characteristics of Subject, Predicate and Object of Burmese function word and auxiliary word identification Burmese, the syntactic structure information of each word is spliced in term vector, BiLSTM-CNN is reused to encoding in sentence, using output probability as measure whether be parallel sentence pairs condition.And the remote bilingual parallel sentence pairs draw-out device of the English based on BiLSTM-CNN is made according to above-mentioned steps function modoularization.The more traditional bilingual parallel sentence pairs identifying system of the present invention is more simple, the experimental results showed that this method and device are superior to baseline system in the performance of the indexs such as accuracy and recall rate, accurate rate is generally all promoted.

Description

The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device
Technical field
The present invention relates to the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN and devices, belong to natural language Processing technology field.
Background technique
In natural language processing field, the scale of Parallel Corpus has important work for the raising of machine translation performance With, and for the Burmese of scarcity of resources type, English Burma parallel corpora resource is seriously deficient, and mechanical translation quality has not yet been reached Realistic scale.The method of traditional acquisition parallel corpora has human translation and utilizes machine translation translation, then the former cost compared with High efficiency is lower, and the latter then relies on machine translation performance, of poor quality.Parallel Corpus scale is seldom relatively on network, And how comparable corpora obtains the remote bilingual parallel sentence of English compared to more more using the English Burma comparable corpus of magnanimity on internet To being of great significance.
In recent years, parallel sentence pairs were extracted from comparable corpora there are many method, such as constructed using maximum entropy method Classifier constructs Chinese-English translation system to extract parallel sentence pairs in a large amount of comparable corpuses, and this method greatly relies on In Feature Engineering, and a large amount of parallel corpora is needed, it is extremely not applicable for the language of scarcity of resources.Traditional method is often only Consider semantic information possessed by itself language, and in fact, including corresponding functional structure between different language, due to sentence language Justice expression is closely related with syntactic structure, though existing representation method can retain word order information in sentence to a certain extent, It is not avoided that the loss of syntactic structure information, it is difficult to which accurately sentence expression is arrived in study.
Summary of the invention
The unresolved above problem, the present invention provides the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN and Device, the indexs such as accuracy and recall rate of the present invention show and are superior to baseline system, and accurate rate is generally all promoted.
The technical scheme is that the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN, the method Specific step is as follows:
Step1, using the English Burma intertranslation article crawled from English Burma intertranslation website, by manually being screened and being aligned to obtain 20000 parallel sentence pairs carry out bilingual term vector pre-training, the word embedded space that English-Burma is shared across language are obtained, to obtain again The remote bilingual term vector of English, the semantic vector for allowing sentence to characterize have correlation across English-Burma semantic space;
Step2, functional label is carried out to sentence, the syntactic structure information of each word is spliced in term vector, obtain English- Remote syntactic gap;
Step3, it is transmitted using the information that BiLSTM carries out forward and reverse to each word information in sentence, is included Significant condition caused by the different timing of contextual information;Using the syntactic feature of CNN network representation sentence, obtain this 's Semantic feature;
Step4, the element product of the semantic feature obtained by using above-mentioned steps and element absolute difference capture original sentence The match information of the semantic feature of son and target sentences, is fed to full articulamentum for match information, using output probability as measurement It whether is the condition of parallel sentence pairs to extract parallel sentence pairs.
Further, specific step is as follows by the step Step1:
The English Burma intertranslation article crawled from English Burma intertranslation website, by manually being screened and being aligned to obtain 20,000 in parallel Sentence pair;
The Burmese Words partition system researched and developed using Kunming University of Science and Technology, network address 222.197.219.24:8099, to Burma Language carries out participle and the method training by using English-Burma seed dictionary, carrying out having supervision, and using general Luke, this spy is right Neat Procrustes alignment is iterated, and learns the mapping from original language to object language, and it is total across language to obtain English-Burma The word embedded space enjoyed, to obtain the remote bilingual term vector of English;Wherein the muse system of Facebook can be used to realize.
Go out the word embedded space that English-Burma is shared across language using the muse system pre-training of Facebook, to make semanteme The word of similar different language distance in term vector space is close, and the semantic vector for allowing sentence to characterize is empty across English-Burmese justice Between have correlation;
Further, in the step Step2, when carrying out functional label to sentence, known by Burmese function word and auxiliary word Subject, Predicate and Object in other sentence, comprising:
Burmese is segmented using Burmese Words partition system;
Syllable splitting is carried out to Burmese using Burmese Words partition system;
Pass through the Burmese function word for being placed on noun and preposition function word and subject and predicate in noun, object component auxiliary word Identify the function word auxiliary word in sentence;
The function word auxiliary word obtained through the above steps carries out Subject, Predicate and Object mark;
When carrying out functional label to sentence, function mark is carried out to English by Stamford tool, is then only retained and remote Syntactic structure corresponding to the sentence of pasture.
The characteristics of capable of identifying the Subject, Predicate and Object of Burmese using Burmese function word and auxiliary word, is cut using Burmese participle and syllable Division of labor tool, identifies the position of function word and auxiliary word in the sentence, carries out functional label to sentence, and the syntactic structure of each word is believed Breath is spliced in term vector;
Further, specific step is as follows for step Step3, Step4:
It is transmitted by the information that BiLSMT carries out forward and reverse to each word information in sentence, utilizes the context of word Information, to obtain significant condition caused by the different timing comprising contextual information;
Using the syntactic feature of CNN network representation sentence, convolution sum Chi Hualai is carried out to obtained significant condition and extracts sentence Sub- key semantic feature, the Deep Semantics for obtaining this indicate;
Use Adam as model optimizer;
Cross entropy is used to assess as loss function model;
The element product of the semantic feature obtained by using above-mentioned steps and element absolute difference capture source sentence and mesh The match information for marking the semantic feature of sentence, is fed to full articulamentum for match information, is then fed to full articulamentum to estimate The probability that sentence is mutually translated, then output probability is extracted parallel sentence pairs as whether measurement is the condition of parallel sentence pairs, lead to Output probability is crossed to determine whether for parallel sentence pairs;Wherein, by output probability and threshold value comparison, the then judgement greater than threshold value is No is parallel sentence pairs, and threshold value takes 0.8 or 0.9.
Design according to the present invention, the present invention also provides a kind of, and the remote bilingual parallel sentence pairs of the English based on BiLSTM-CNN are taken out Device is taken, as shown in fig. 6, the device includes:
Data acquisition module, for using web crawlers technology obtain network on English Burma intertranslation article and to data into Row cleaning;
Term vector module carries out the instruction of bilingual term vector for the Muse tool using Facebook to the data got Practice;
Function labeling module carries out participle and syllable splitting to Burmese using Burmese Words partition system, recycles remote Pasture language function word and auxiliary word carry out function mark to Burmese, carry out function mark to English by Stamford tool;
Sentence characterization module, for extracting the semantic feature of sentence using BiLSTM-CNN;
Output module that is, after the semantic feature of extraction sentence, is used after encoding to source statement and object statement In by using semantic feature element product and element absolute difference capture the matching of the semantic feature of sentence and target sentences Whether match information is fed to full articulamentum by information, be that the condition of parallel sentence pairs is flat to extract using output probability as measuring Row sentence pair.
The beneficial effects of the present invention are:
The present invention merges CNN and Bi-LSTM, and the advantage of local feature can be extracted using CNN and is utilized Advantage of the BiLSTM to text sequence global characteristics.It solves the problems, such as that CNN ignores the context implication of word using BiLSTM, and mentions The sentence characterization of the BiLSTM-CNN of fusion function mark out, embodies the influence of language difference distich sublist sign, effectively benefit With the external knowledge of language, the accuracy rate of parallel sentence pairs extraction model is improved;
The more traditional bilingual parallel sentence pairs identifying system of method and device proposed by the present invention is more simple, experimental result table Bright this method and device are superior to baseline system in the performance of the indexs such as accuracy and recall rate, and accurate rate is generally all mentioned It rises.
Detailed description of the invention
Fig. 1 is the idiographic flow block diagram in the present invention;
Fig. 2 is the remote bilingual term vector space schematic diagram of English in the present invention;
Fig. 3 is that the Burmese function of sentence in the present invention marks flow chart;
Fig. 4 is the accuracy rate line chart that three kinds of models in the present invention obtain;
Fig. 5 is the recall rate line chart that three kinds of models in the present invention obtain;
Fig. 6 is the apparatus structure block diagram in the present invention;
Fig. 7 is the flow chart in the present invention.
Specific embodiment
Embodiment 1: as shown in figs. 1-7, the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN, Fig. 7 is this The flow chart of invention.The following steps are included: step A in this method: pre-training goes out the word embedded space that English-Burma is shared across language, To make semantic similarity different language word in term vector space distance it is close, the semantic vector for allowing sentence to characterize across English-Burma semantic space has correlation;Step B: functional label is carried out to sentence, the syntactic structure information of each word is spliced to In term vector, English-Burma syntactic gap can be obtained;Step C: each word information in sentence is carried out just using BiLSTM It is transmitted to reversed information, obtains significant condition caused by the different timing comprising contextual information, and then mention using CNN The characteristics of taking feature obtains the Deep Semantics feature of this;D step: after being encoded to source statement and object statement, i.e., After the semantic feature for extracting sentence, sentence and target are captured by using the element product and element absolute difference of semantic feature Whether match information is fed to full articulamentum by the match information of the semantic feature of sentence, be flat using output probability as measuring The condition of row sentence pair extracts parallel sentence pairs.Fig. 1 provides the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN Idiographic flow block diagram.
In step A, experimental data set source of the invention is mainly the English Burma intertranslation article crawled from English Burma intertranslation net, 20,000 parallel sentence pairs are obtained by manually being screened and being aligned.The Burmese Words partition system researched and developed using Kunming University of Science and Technology, Network address is 222.197.219.24:8099, carries out segmenting and utilizing English-Burma seed dictionary to Burmese, has carried out prison The method training superintended and directed, and using general Luke, this spy's alignment (Procrustes alignment) is iterated, and is learnt from source language Say the mapping of object language.The main muse system using Facebook is realized, trains English-Burma term vector space as schemed Shown in 2, term vector dimension is set as 300 dimensions.Meanwhile the number based on the negative sample for assuming each parallel sentence pairs is 7, with mechanism Build 140,000 non-parallel corpus.In order to measure the performance of the parallel classifier in the English-Burma well, it is flat to choose 0.2 Wan Ying-Burma The test set of row corpus, 0.4 Wan Ying-Burma non-parallel corpus as experiment.
In step B, because Burmese belongs to low-resource language, the relevant natural language processing research work of Burmese Carry out slowly, there has been no the tool of more mature part-of-speech tagging and syntactic structure etc., and the symbol of Burmese function word and auxiliary word It number can recognize that subject in sentence, predicate, object etc., be shown in Table 1;
1 Burma's auxiliary word of table, function word example sentence
It is inquired by related data, specific function word and auxiliary word are shown in Table 2,3
2 Burmese function word list of table
3 Burmese structural auxiliary word list of table
Therefore the characteristics of present invention can identify the Subject, Predicate and Object of Burmese using Burmese function word and auxiliary word uses Burmese point Word and syllable splitting tool, identify the position of function word and auxiliary word in the sentence.Functional label is carried out to sentence, detailed process is such as Shown in Fig. 3.For the function mark of English, the present invention uses Stamford tool, carries out functional label to english sentence, then only Retain and syntactic structure corresponding to Burma's sentence.The syntactic structure of sentence, mainly subject, meaning are obtained based on above step Language, object.
In step C comprising the steps of: C01 step: being carried out using BiLSTM to each word information in sentence positive It is transmitted with reversed information, obtains significant condition caused by the different timing comprising contextual information;C02 step: CNN is utilized The characteristics of extracting feature, the Deep Semantics for obtaining this indicate.
In C01 step, Recognition with Recurrent Neural Network (RNN) has been widely used for processing variable length sequence input, and LSTM is A kind of popular variant of RNN, the gradient for alleviating RNN disappears and gradient explosion issues.When a given sentence x=x (1), X (2) ..., x (n), t ∈ n }, wherein what x (t) was indicated is the term vector of k dimension.Hiding vector h (t) in time step is t moment More new formula is as follows:
it=σ (Wix(t)+Uih(t-1)+bi (1)
ft=σ (Wfx(t)+Ufh(t-1)+bf (2)
ot=σ (Wox(t)+Uoh(t-1)+bo (3)
ht=ot*tanh(Ct) (6)
htg=ot*tanh(Ct+G) (7)
Wherein, itIndicate input gate, ftIt indicates to forget door, otOut gate is represented, σ represents sigmoid function. WithIt is the parameter of network, the formula after syntactic information is added in hidden layer vector is updated to (7) herein, Wherein " G " represents functional label operation, indicates the vector of Subject, Predicate and Object.
In unidirectional LSTM, it can not consider using reversed contextual information, and two-way LSTM is by two directions Processing sequence utilizes context, and generates two independent LSTM output vector sequences, and one be used to handle positive input Sequence, and another handles reversed input, the output of each time step is the string of two output vectors from both direction Connection, i.e.,
In C02 step, by convolutional layer, pond layer forms the CNN of most original with layer is fully connected.It is n's for length Sentence, it is represented byWhereinIt is attended operation,Indicate i-th of term vector, d table It is shown as the dimension of term vector.The core of convolution operation is by filterCome applied to the word sequence that window size is h Generate a new feature ci, as shown by the equation:
ci=f (WxI:i+h-1+b) (8)
Wherein,It is a bias vector, f is a nonlinear function (for example, Sigmoid, ReLU).Length is n Sentence pass through any one continuous word sequence { x in available sentence after convolutional layer1:h,x2:h+1..., xN-f+1:nDeep layer Semantic feature c, as shown by the equation:
C=[c1, c2..., cn-h+1] (9)
In the present invention, the convolution kernel F=[F (0) ... F (m-1)] that window is m carries out the output vector of Bi-LSTM Convolution obtains Feature Mapping, as shown by the equation:
Wherein, b is bias term, and F and b are the parameters of the single filter.
By typical CNN structure it is found that pond layer building is at the top of convolutional layer.Herein by K-Max Pooling will retain maximum k for each filter, i.e.,
In D step, the semantic feature of source sentence and target sentences is extracted by above step, i.e.,Pass through again Element product and absolute element difference capture their match information, are then fed to full articulamentum to estimate that sentence is mutually translated Probability, specific formula is as follows:
p(yi|ci)=σ (Wcci+c) (14)
Wherein σ represents activation primitive, Wa,Wa,Wa, b, c are the parameters of model, which selects cross entropy as target letter Number:
Wherein n is the number of source sentence, and m is the number of candidate target sentence.
Experimental data set source of the invention is mainly the English Burma intertranslation article crawled from English Burma intertranslation website, by artificial It is screened and is aligned to obtain 20,000 parallel sentence pairs.Go out the bilingual term vector of high quality using Muse tool pre-training, word to Amount dimension is set as 300 dimensions.Actual conditions are shown in Table 4;
4 corpus scale of table
Language Sentence number (ten thousand)
English-Burma parallel corpora 2.0
English-Burma non-parallel corpus 14.0
Meanwhile the number based on the negative sample for assuming each parallel sentence pairs is 7, constructs 140,000 non-parallel corpus at random. In order to measure the performance of the parallel classifier in the English-Burma well, selection 0.2 Wan Ying-Burma parallel corpora, 0.4 Wan Ying-Burma are non- Test set of the parallel corpora as experiment, is shown in Table 5
5 experimental data of table
In terms of evaluation index, select accuracy rate (Accuracy), accurate rate (Precision), recall rate (Recall) as measurement, whether the model can make correct classification to the parallel sentence in English-Burma with F value (F1-Measure).
Specific formula is as follows:
Wherein TP is genuine positive example, and FP is false counter-example, and FP is false positive example, and TN is genuine counter-example.
In order to embody method validity proposed by the invention, Gr é goire, Francis et al. are proposed to utilize by the present invention Bidirectional Recurrent Neural Networks method extracts parallel sentence pairs as benchmark model.Meanwhile in order to Prominent deep learning has better accuracy than conventional machines study in the building of classifier, and the present invention is also by Munteanu The maximum entropy model that D S et al. is proposed is tested as a comparison.
The selection of experiment parameter directly affects last experimental result, and following table lists BiLSTM, CNN and experiment parameter Setting.It is shown in Table 6,7,8.
6 BiLSTM parameter setting of table
Parameter Parameter value
Embeding layer dimension 300 dimensions
Hidden state dimension 300 dimensions
The number of plies 3 layers
7 CNN parameter setting of table
Parameter Parameter value
Sliding window size 2,3,4
Sliding window number 300
Hidden state dimension 600 dimensions
Pond layer Maximum pond
The setting of 8 experiment parameter of table
Parameter Parameter value
Batch 128
Learning rate 0.001
Threshold value 0.90
Dropout 0.8
Loss function Cross entropy
Majorized function Adam
It, will for the performance for verifying the remote bilingual parallel sentence pairs extraction model of the English proposed by the present invention based on BiLSTM-CNN BiLSTM and maximum entropy compare under different threshold values with BiLSTM-CNN method proposed by the present invention as benchmarks. It is shown in Table 9
9 Experimental comparison's table of table
From the results, it was seen that carry out the classifier that feature extraction trains using deep learning compares maximum entropy in performance Model training go out classifier it is preferable, preferable feature can be learnt automatically by being primarily due to neural network, meanwhile, LSTM and The combination of CNN is preferable using LSTM effect than merely, and analyzing its reason is mainly since LSTM is to global semantic extraction, In Prior semantic feature, the in this way semanteme than matching two sentences using simple LSTM can be obtained by CNN thereon It is more accurate.Different from the threshold value of selection, the effect of model also can be different, takes out if it is preferable parallel sentence pairs in order to obtain Modulus type, usual threshold value are arranged 0.9 or more.It accuracy rate and recalls worth line chart and sees Fig. 4,5.
In the present invention, great influence effect can be had to sentence pair extraction model by equally proposing function mark.Therefore, The present invention has carried out the model for incorporating function mark and the model for not incorporating function mark in the case where threshold value is 0.9 following Several groups of experiments.It is shown in Table 10.
Whether table 10 incorporates the model comparison of function mark
Model method Precision (%) Recall (%) F-Measure (%)
CNN 62.4% 60.3% 61.5%
CNN+ function mark 63.5% 61.2% 62.3%
Bi-LSTM 68.3% 62.4% 65.9%
LSTM+ function mark 69.4% 62.8% 56.5%
BiLSTM+mean_pooling 68.6% 62.6% 66.0%
BiLSTM+mean_pooling+ function mark 69.5% 63.1% 66.6%
BiLSTM+max_pooling 67.9% 62.2% 65.1%
BiLSTM+max_pooling+ function mark 68.1% 62.4% 65.5%
BiLSTM-CNN 71.6% 64.3% 67.2%
BiLSTM-CNN+ function mark 72.2% 65.1% 68.3%
From table 10, it can be seen that the model of fusion function mark has certain than the model that not fusion function marks It is promoted by a small margin, is primarily due to function mark as external knowledge, can have certain guidance to the expression of sentence Property effect.Meanwhile the present invention also provides three groups of comparative experimentss i.e. to the operation after BiLSTM: mean_pooling, max_ Pooling and CNN, is compared by experimental result, and the effect of CNN is still more excellent, and main cause is still CNN to important The extraction of feature;The effect that BiLSTM-CNN+ function of the invention marks is best.
Design according to the present invention, the present invention also provides a kind of, and the remote bilingual parallel sentence pairs of the English based on BiLSTM-CNN are taken out Device is taken, as shown in fig. 6, the device includes:
Data acquisition module, for using web crawlers technology obtain network on English Burma intertranslation article and to data into Row cleaning;Term vector module carries out the instruction of bilingual term vector for the Muse tool using Facebook to the data got Practice;Function labeling module, for using Kunming University of Science and Technology's Burmese Words partition system to Burmese carry out participle and syllable cut Point, it recycles Burmese function word and auxiliary word to carry out function mark to Burmese, function mark is carried out to English by Stamford tool Note;Sentence characterization module, for extracting the semantic feature of sentence using BiLSTM-CNN;Output module, to source statement and mesh After poster sentence is encoded, i.e., extraction sentence semantic feature after, for by using semantic feature element product and Element absolute difference captures the match information of the semantic feature of sentence and target sentences, and match information is fed to full articulamentum, Output probability is extracted parallel sentence pairs as whether measurement is the condition of parallel sentence pairs.
Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (5)

1. a kind of remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN, it is characterised in that:
Specific step is as follows for the method:
Step1, using the English Burma intertranslation article crawled from English Burma intertranslation website, by manually being screened and being aligned to obtain 20,000 A parallel sentence pairs carry out bilingual term vector pre-training, obtain the word embedded space that English-Burma is shared across language, to obtain English Burma again Bilingual term vector, the semantic vector for allowing sentence to characterize have correlation across English-Burma semantic space;
Step2, functional label is carried out to sentence, the syntactic structure information of each word is spliced in term vector, obtain English-Burma Syntactic gap;
Step3, it is transmitted, is obtained comprising up and down using the information that BiLSTM carries out forward and reverse to each word information in sentence Significant condition caused by the different timing of literary information;Using the syntactic feature of CNN network representation sentence, the semanteme of this is obtained Feature;
Step4, the element product of the semantic feature obtained by using above-mentioned steps and element absolute difference come capture source sentence and Match information is fed to full articulamentum by the match information of the semantic feature of target sentences, using output probability as measure whether Parallel sentence pairs are extracted for the conditions of parallel sentence pairs.
2. the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN according to claim 1, it is characterised in that: Specific step is as follows by the step Step1:
The English Burma intertranslation article crawled from English Burma intertranslation website obtains 20,000 parallel sentence pairs by manually being screened and being aligned;
Using Kunming University of Science and Technology research and develop Burmese Words partition system, network address 222.197.219.24:8099, to Burmese into Row participle and the method training by using English-Burma seed dictionary, carrying out having supervision, using general Luke, this spy is aligned Procrustes alignment is iterated, and learns the mapping from original language to object language, and it is shared across language to obtain English-Burma Word embedded space, to obtain the remote bilingual term vector of English;Wherein the muse system of Facebook can be used to realize.
3. the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN according to claim 1, it is characterised in that: In the step Step2, when carrying out functional label to sentence, the Subject, Predicate and Object in sentence is identified by Burmese function word and auxiliary word, Include:
Burmese is segmented using Burmese Words partition system;
Syllable splitting is carried out to Burmese using Burmese Words partition system;
It is identified by the Burmese function word for being placed on noun and preposition function word and subject and predicate, object component auxiliary word in noun Function word auxiliary word in sentence;
The function word auxiliary word obtained through the above steps carries out Subject, Predicate and Object mark;
When carrying out functional label to sentence, function mark is carried out to English by Stamford tool.
4. the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN according to claim 1, it is characterised in that: The step Step3,4 specific step is as follows:
It is transmitted by the information that BiLSMT carries out forward and reverse to each word information in sentence, is believed using the context of word Breath, to obtain significant condition caused by the different timing comprising contextual information;
Using the syntactic feature of CNN network representation sentence, convolution sum Chi Hualai is carried out to obtained significant condition and extracts sentence pass Key semantic feature, the Deep Semantics for obtaining this indicate;
Use Adam as model optimizer;
Cross entropy is used to assess as loss function model;
The element product of the semantic feature obtained by using above-mentioned steps and element absolute difference capture source sentence and target sentence The match information of the semantic feature of son, is fed to full articulamentum for match information, is then fed to full articulamentum to estimate sentence The probability mutually translated, then output probability is extracted parallel sentence pairs as whether measurement is the condition of parallel sentence pairs, by defeated Probability is to determine whether be parallel sentence pairs out;Wherein, by output probability and threshold value comparison, greater than then judging whether it is for threshold value Parallel sentence pairs, threshold value take 0.8 or 0.9.
5. a kind of remote bilingual parallel sentence pairs draw-out device of the English based on BiLSTM-CNN characterized by comprising
Data acquisition module, for obtaining the English Burma intertranslation article on network using web crawlers technology and being carried out to data clear It washes;
Term vector module carries out the training of bilingual term vector for the Muse tool using Facebook to the data got;
Function labeling module carries out participle and syllable splitting to Burmese using Burmese Words partition system, recycles Burmese Function word and auxiliary word carry out function mark to Burmese, carry out function mark to English by Stamford tool;
Sentence characterization module, for extracting the semantic feature of sentence using BiLSTM-CNN;
Output module, after being encoded to source statement and object statement, i.e., after the semantic feature of extraction sentence, for leading to It crosses using the element product of semantic feature and element absolute difference and captures the match information of the semantic feature of sentence and target sentences, Match information is fed to full articulamentum, output probability is extracted parallel sentence as whether measurement is the condition of parallel sentence pairs It is right.
CN201910873805.XA 2019-07-09 2019-09-17 Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN Active CN110414009B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910613175 2019-07-09
CN2019106131752 2019-07-09

Publications (2)

Publication Number Publication Date
CN110414009A true CN110414009A (en) 2019-11-05
CN110414009B CN110414009B (en) 2021-02-05

Family

ID=68370528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910873805.XA Active CN110414009B (en) 2019-07-09 2019-09-17 Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN

Country Status (1)

Country Link
CN (1) CN110414009B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
CN111310480A (en) * 2020-01-20 2020-06-19 昆明理工大学 Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
CN111460830A (en) * 2020-03-11 2020-07-28 北京交通大学 Method and system for extracting economic events in judicial texts
CN111709245A (en) * 2020-04-30 2020-09-25 昆明理工大学 Chinese-Yuan pseudo parallel sentence pair extraction method based on semantic self-adaptive coding
CN112232090A (en) * 2020-09-17 2021-01-15 昆明理工大学 Chinese-crossing parallel sentence pair extraction method fusing syntactic structure and Tree-LSTM
CN112287695A (en) * 2020-09-18 2021-01-29 昆明理工大学 Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN112800779A (en) * 2021-03-29 2021-05-14 智慧芽信息科技(苏州)有限公司 Text processing method and device and model training method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN109117472A (en) * 2018-11-12 2019-01-01 新疆大学 A kind of Uighur name entity recognition method based on deep learning
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN109559781A (en) * 2018-10-24 2019-04-02 成都信息工程大学 A kind of two-way LSTM and CNN model that prediction DNA- protein combines
CN109783817A (en) * 2019-01-15 2019-05-21 浙江大学城市学院 A kind of text semantic similarity calculation model based on deeply study
CN109783809A (en) * 2018-12-22 2019-05-21 昆明理工大学 A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus
CN109885686A (en) * 2019-02-20 2019-06-14 延边大学 A kind of multilingual file classification method merging subject information and BiLSTM-CNN

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN109213995A (en) * 2018-08-02 2019-01-15 哈尔滨工程大学 A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN109559781A (en) * 2018-10-24 2019-04-02 成都信息工程大学 A kind of two-way LSTM and CNN model that prediction DNA- protein combines
CN109117472A (en) * 2018-11-12 2019-01-01 新疆大学 A kind of Uighur name entity recognition method based on deep learning
CN109783809A (en) * 2018-12-22 2019-05-21 昆明理工大学 A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus
CN109783817A (en) * 2019-01-15 2019-05-21 浙江大学城市学院 A kind of text semantic similarity calculation model based on deeply study
CN109885686A (en) * 2019-02-20 2019-06-14 延边大学 A kind of multilingual file classification method merging subject information and BiLSTM-CNN

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MING TAN等: "LSTM-BASED DEEP LEARNING MODELS FOR NON-FACTOID ANSWER SELECTION", 《ARXIV:1511.04108V4》 *
WEIXIN_30686845: "是时候给你的产品配一个AI问答助手了!", 《HTTPS//BLOG.CSDN.NET/WEIXIN_30686845/ARTICLE/DETAILS/98848455》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
CN111046946B (en) * 2019-12-10 2021-03-02 昆明理工大学 Burma language image text recognition method based on CRNN
CN111310480A (en) * 2020-01-20 2020-06-19 昆明理工大学 Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
CN111310480B (en) * 2020-01-20 2021-12-28 昆明理工大学 Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
CN111460830A (en) * 2020-03-11 2020-07-28 北京交通大学 Method and system for extracting economic events in judicial texts
CN111460830B (en) * 2020-03-11 2022-04-12 北京交通大学 Method and system for extracting economic events in judicial texts
CN111709245A (en) * 2020-04-30 2020-09-25 昆明理工大学 Chinese-Yuan pseudo parallel sentence pair extraction method based on semantic self-adaptive coding
CN112232090A (en) * 2020-09-17 2021-01-15 昆明理工大学 Chinese-crossing parallel sentence pair extraction method fusing syntactic structure and Tree-LSTM
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN112287695A (en) * 2020-09-18 2021-01-29 昆明理工大学 Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method
CN112800779A (en) * 2021-03-29 2021-05-14 智慧芽信息科技(苏州)有限公司 Text processing method and device and model training method and device

Also Published As

Publication number Publication date
CN110414009B (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN110414009A (en) The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device
CN107992597B (en) Text structuring method for power grid fault case
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
Belinkov et al. Arabic diacritization with recurrent neural networks
CN109871535A (en) A kind of French name entity recognition method based on deep neural network
CN112541343B (en) Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN110532557B (en) Unsupervised text similarity calculation method
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN110287323B (en) Target-oriented emotion classification method
CN110717324B (en) Judgment document answer information extraction method, device, extractor, medium and equipment
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN110532549A (en) A kind of text emotion analysis method based on binary channels deep learning model
CN109766544A (en) Document keyword abstraction method and device based on LDA and term vector
CN112257441B (en) Named entity recognition enhancement method based on counterfactual generation
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN108255813A (en) A kind of text matching technique based on term frequency-inverse document and CRF
CN110162592A (en) A kind of news keyword extracting method based on the improved TextRank of gravitation
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN112580330B (en) Vietnam news event detection method based on Chinese trigger word guidance
Li et al. Publication date estimation for printed historical documents using convolutional neural networks
Chen et al. Extractive text-image summarization using multi-modal RNN
CN110427458A (en) Five bilingual classification sentiment analysis methods of social networks based on two-door LSTM
CN110110116A (en) A kind of trademark image retrieval method for integrating depth convolutional network and semantic analysis
CN111967267B (en) XLNET-based news text region extraction method and system
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant