CN110414009A

CN110414009A - The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device

Info

Publication number: CN110414009A
Application number: CN201910873805.XA
Authority: CN
Inventors: 毛存礼; 梁昊远; 余正涛; 张少宁; 张亚飞; 朱浩东
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-07-09
Filing date: 2019-09-17
Publication date: 2019-11-05
Anticipated expiration: 2039-09-17
Also published as: CN110414009B

Abstract

The present invention relates to the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN and devices, belong to natural language processing technique field.The present invention passes through Muse tool pre-training first and goes out bilingual term vector, then functional label is carried out to sentence using the characteristics of Subject, Predicate and Object of Burmese function word and auxiliary word identification Burmese, the syntactic structure information of each word is spliced in term vector, BiLSTM-CNN is reused to encoding in sentence, using output probability as measure whether be parallel sentence pairs condition.And the remote bilingual parallel sentence pairs draw-out device of the English based on BiLSTM-CNN is made according to above-mentioned steps function modoularization.The more traditional bilingual parallel sentence pairs identifying system of the present invention is more simple, the experimental results showed that this method and device are superior to baseline system in the performance of the indexs such as accuracy and recall rate, accurate rate is generally all promoted.

Description

The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device

Technical field

The present invention relates to the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN and devices, belong to natural language Processing technology field.

Background technique

In natural language processing field, the scale of Parallel Corpus has important work for the raising of machine translation performance With, and for the Burmese of scarcity of resources type, English Burma parallel corpora resource is seriously deficient, and mechanical translation quality has not yet been reached Realistic scale.The method of traditional acquisition parallel corpora has human translation and utilizes machine translation translation, then the former cost compared with High efficiency is lower, and the latter then relies on machine translation performance, of poor quality.Parallel Corpus scale is seldom relatively on network, And how comparable corpora obtains the remote bilingual parallel sentence of English compared to more more using the English Burma comparable corpus of magnanimity on internet To being of great significance.

In recent years, parallel sentence pairs were extracted from comparable corpora there are many method, such as constructed using maximum entropy method Classifier constructs Chinese-English translation system to extract parallel sentence pairs in a large amount of comparable corpuses, and this method greatly relies on In Feature Engineering, and a large amount of parallel corpora is needed, it is extremely not applicable for the language of scarcity of resources.Traditional method is often only Consider semantic information possessed by itself language, and in fact, including corresponding functional structure between different language, due to sentence language Justice expression is closely related with syntactic structure, though existing representation method can retain word order information in sentence to a certain extent, It is not avoided that the loss of syntactic structure information, it is difficult to which accurately sentence expression is arrived in study.

Summary of the invention

The unresolved above problem, the present invention provides the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN and Device, the indexs such as accuracy and recall rate of the present invention show and are superior to baseline system, and accurate rate is generally all promoted.

The technical scheme is that the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN, the method Specific step is as follows:

Step1, using the English Burma intertranslation article crawled from English Burma intertranslation website, by manually being screened and being aligned to obtain 20000 parallel sentence pairs carry out bilingual term vector pre-training, the word embedded space that English-Burma is shared across language are obtained, to obtain again The remote bilingual term vector of English, the semantic vector for allowing sentence to characterize have correlation across English-Burma semantic space；

Step2, functional label is carried out to sentence, the syntactic structure information of each word is spliced in term vector, obtain English- Remote syntactic gap；

Step3, it is transmitted using the information that BiLSTM carries out forward and reverse to each word information in sentence, is included Significant condition caused by the different timing of contextual information；Using the syntactic feature of CNN network representation sentence, obtain this 's Semantic feature；

Step4, the element product of the semantic feature obtained by using above-mentioned steps and element absolute difference capture original sentence The match information of the semantic feature of son and target sentences, is fed to full articulamentum for match information, using output probability as measurement It whether is the condition of parallel sentence pairs to extract parallel sentence pairs.

Further, specific step is as follows by the step Step1:

The English Burma intertranslation article crawled from English Burma intertranslation website, by manually being screened and being aligned to obtain 20,000 in parallel Sentence pair；

The Burmese Words partition system researched and developed using Kunming University of Science and Technology, network address 222.197.219.24:8099, to Burma Language carries out participle and the method training by using English-Burma seed dictionary, carrying out having supervision, and using general Luke, this spy is right Neat Procrustes alignment is iterated, and learns the mapping from original language to object language, and it is total across language to obtain English-Burma The word embedded space enjoyed, to obtain the remote bilingual term vector of English；Wherein the muse system of Facebook can be used to realize.

Go out the word embedded space that English-Burma is shared across language using the muse system pre-training of Facebook, to make semanteme The word of similar different language distance in term vector space is close, and the semantic vector for allowing sentence to characterize is empty across English-Burmese justice Between have correlation；

Further, in the step Step2, when carrying out functional label to sentence, known by Burmese function word and auxiliary word Subject, Predicate and Object in other sentence, comprising:

Burmese is segmented using Burmese Words partition system；

Syllable splitting is carried out to Burmese using Burmese Words partition system；

Pass through the Burmese function word for being placed on noun and preposition function word and subject and predicate in noun, object component auxiliary word Identify the function word auxiliary word in sentence；

The function word auxiliary word obtained through the above steps carries out Subject, Predicate and Object mark；

When carrying out functional label to sentence, function mark is carried out to English by Stamford tool, is then only retained and remote Syntactic structure corresponding to the sentence of pasture.

The characteristics of capable of identifying the Subject, Predicate and Object of Burmese using Burmese function word and auxiliary word, is cut using Burmese participle and syllable Division of labor tool, identifies the position of function word and auxiliary word in the sentence, carries out functional label to sentence, and the syntactic structure of each word is believed Breath is spliced in term vector；

Further, specific step is as follows for step Step3, Step4:

It is transmitted by the information that BiLSMT carries out forward and reverse to each word information in sentence, utilizes the context of word Information, to obtain significant condition caused by the different timing comprising contextual information；

Using the syntactic feature of CNN network representation sentence, convolution sum Chi Hualai is carried out to obtained significant condition and extracts sentence Sub- key semantic feature, the Deep Semantics for obtaining this indicate；

Use Adam as model optimizer；

Cross entropy is used to assess as loss function model；

The element product of the semantic feature obtained by using above-mentioned steps and element absolute difference capture source sentence and mesh The match information for marking the semantic feature of sentence, is fed to full articulamentum for match information, is then fed to full articulamentum to estimate The probability that sentence is mutually translated, then output probability is extracted parallel sentence pairs as whether measurement is the condition of parallel sentence pairs, lead to Output probability is crossed to determine whether for parallel sentence pairs；Wherein, by output probability and threshold value comparison, the then judgement greater than threshold value is No is parallel sentence pairs, and threshold value takes 0.8 or 0.9.

Design according to the present invention, the present invention also provides a kind of, and the remote bilingual parallel sentence pairs of the English based on BiLSTM-CNN are taken out Device is taken, as shown in fig. 6, the device includes:

Data acquisition module, for using web crawlers technology obtain network on English Burma intertranslation article and to data into Row cleaning；

Term vector module carries out the instruction of bilingual term vector for the Muse tool using Facebook to the data got Practice；

Function labeling module carries out participle and syllable splitting to Burmese using Burmese Words partition system, recycles remote Pasture language function word and auxiliary word carry out function mark to Burmese, carry out function mark to English by Stamford tool；

Sentence characterization module, for extracting the semantic feature of sentence using BiLSTM-CNN；

Output module that is, after the semantic feature of extraction sentence, is used after encoding to source statement and object statement In by using semantic feature element product and element absolute difference capture the matching of the semantic feature of sentence and target sentences Whether match information is fed to full articulamentum by information, be that the condition of parallel sentence pairs is flat to extract using output probability as measuring Row sentence pair.

The beneficial effects of the present invention are:

The present invention merges CNN and Bi-LSTM, and the advantage of local feature can be extracted using CNN and is utilized Advantage of the BiLSTM to text sequence global characteristics.It solves the problems, such as that CNN ignores the context implication of word using BiLSTM, and mentions The sentence characterization of the BiLSTM-CNN of fusion function mark out, embodies the influence of language difference distich sublist sign, effectively benefit With the external knowledge of language, the accuracy rate of parallel sentence pairs extraction model is improved；

The more traditional bilingual parallel sentence pairs identifying system of method and device proposed by the present invention is more simple, experimental result table Bright this method and device are superior to baseline system in the performance of the indexs such as accuracy and recall rate, and accurate rate is generally all mentioned It rises.

Detailed description of the invention

Fig. 1 is the idiographic flow block diagram in the present invention；

Fig. 2 is the remote bilingual term vector space schematic diagram of English in the present invention；

Fig. 3 is that the Burmese function of sentence in the present invention marks flow chart；

Fig. 4 is the accuracy rate line chart that three kinds of models in the present invention obtain；

Fig. 5 is the recall rate line chart that three kinds of models in the present invention obtain；

Fig. 6 is the apparatus structure block diagram in the present invention；

Fig. 7 is the flow chart in the present invention.

Specific embodiment

Embodiment 1: as shown in figs. 1-7, the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN, Fig. 7 is this The flow chart of invention.The following steps are included: step A in this method: pre-training goes out the word embedded space that English-Burma is shared across language, To make semantic similarity different language word in term vector space distance it is close, the semantic vector for allowing sentence to characterize across English-Burma semantic space has correlation；Step B: functional label is carried out to sentence, the syntactic structure information of each word is spliced to In term vector, English-Burma syntactic gap can be obtained；Step C: each word information in sentence is carried out just using BiLSTM It is transmitted to reversed information, obtains significant condition caused by the different timing comprising contextual information, and then mention using CNN The characteristics of taking feature obtains the Deep Semantics feature of this；D step: after being encoded to source statement and object statement, i.e., After the semantic feature for extracting sentence, sentence and target are captured by using the element product and element absolute difference of semantic feature Whether match information is fed to full articulamentum by the match information of the semantic feature of sentence, be flat using output probability as measuring The condition of row sentence pair extracts parallel sentence pairs.Fig. 1 provides the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN Idiographic flow block diagram.

In step A, experimental data set source of the invention is mainly the English Burma intertranslation article crawled from English Burma intertranslation net, 20,000 parallel sentence pairs are obtained by manually being screened and being aligned.The Burmese Words partition system researched and developed using Kunming University of Science and Technology, Network address is 222.197.219.24:8099, carries out segmenting and utilizing English-Burma seed dictionary to Burmese, has carried out prison The method training superintended and directed, and using general Luke, this spy's alignment (Procrustes alignment) is iterated, and is learnt from source language Say the mapping of object language.The main muse system using Facebook is realized, trains English-Burma term vector space as schemed Shown in 2, term vector dimension is set as 300 dimensions.Meanwhile the number based on the negative sample for assuming each parallel sentence pairs is 7, with mechanism Build 140,000 non-parallel corpus.In order to measure the performance of the parallel classifier in the English-Burma well, it is flat to choose 0.2 Wan Ying-Burma The test set of row corpus, 0.4 Wan Ying-Burma non-parallel corpus as experiment.

In step B, because Burmese belongs to low-resource language, the relevant natural language processing research work of Burmese Carry out slowly, there has been no the tool of more mature part-of-speech tagging and syntactic structure etc., and the symbol of Burmese function word and auxiliary word It number can recognize that subject in sentence, predicate, object etc., be shown in Table 1；

1 Burma's auxiliary word of table, function word example sentence

It is inquired by related data, specific function word and auxiliary word are shown in Table 2,3

2 Burmese function word list of table

3 Burmese structural auxiliary word list of table

Therefore the characteristics of present invention can identify the Subject, Predicate and Object of Burmese using Burmese function word and auxiliary word uses Burmese point Word and syllable splitting tool, identify the position of function word and auxiliary word in the sentence.Functional label is carried out to sentence, detailed process is such as Shown in Fig. 3.For the function mark of English, the present invention uses Stamford tool, carries out functional label to english sentence, then only Retain and syntactic structure corresponding to Burma's sentence.The syntactic structure of sentence, mainly subject, meaning are obtained based on above step Language, object.

In step C comprising the steps of: C01 step: being carried out using BiLSTM to each word information in sentence positive It is transmitted with reversed information, obtains significant condition caused by the different timing comprising contextual information；C02 step: CNN is utilized The characteristics of extracting feature, the Deep Semantics for obtaining this indicate.

In C01 step, Recognition with Recurrent Neural Network (RNN) has been widely used for processing variable length sequence input, and LSTM is A kind of popular variant of RNN, the gradient for alleviating RNN disappears and gradient explosion issues.When a given sentence x=x (1), X (2) ..., x (n), t ∈ n }, wherein what x (t) was indicated is the term vector of k dimension.Hiding vector h (t) in time step is t moment More new formula is as follows:

i_t=σ (W_ix(t)+U_ih(t-1)+b_i (1)

f_t=σ (W_fx(t)+U_fh(t-1)+b_f (2)

o_t=σ (W_ox(t)+U_oh(t-1)+b_o (3)

h_t=o_t*tanh(C_t) (6)

h_tg=o_t*tanh(C_t+G) (7)

Wherein, i_tIndicate input gate, f_tIt indicates to forget door, o_tOut gate is represented, σ represents sigmoid function. WithIt is the parameter of network, the formula after syntactic information is added in hidden layer vector is updated to (7) herein, Wherein " G " represents functional label operation, indicates the vector of Subject, Predicate and Object.

In unidirectional LSTM, it can not consider using reversed contextual information, and two-way LSTM is by two directions Processing sequence utilizes context, and generates two independent LSTM output vector sequences, and one be used to handle positive input Sequence, and another handles reversed input, the output of each time step is the string of two output vectors from both direction Connection, i.e.,

In C02 step, by convolutional layer, pond layer forms the CNN of most original with layer is fully connected.It is n's for length Sentence, it is represented byWhereinIt is attended operation,Indicate i-th of term vector, d table It is shown as the dimension of term vector.The core of convolution operation is by filterCome applied to the word sequence that window size is h Generate a new feature c_i, as shown by the equation:

c_i=f (Wx_I:i+h-1+b) (8)

Wherein,It is a bias vector, f is a nonlinear function (for example, Sigmoid, ReLU).Length is n Sentence pass through any one continuous word sequence { x in available sentence after convolutional layer_1:h,x_2:h+1..., x_N-f+1:nDeep layer Semantic feature c, as shown by the equation:

C=[c₁, c₂..., c_n-h+1] (9)

In the present invention, the convolution kernel F=[F (0) ... F (m-1)] that window is m carries out the output vector of Bi-LSTM Convolution obtains Feature Mapping, as shown by the equation:

Wherein, b is bias term, and F and b are the parameters of the single filter.

By typical CNN structure it is found that pond layer building is at the top of convolutional layer.Herein by K-Max Pooling will retain maximum k for each filter, i.e.,

In D step, the semantic feature of source sentence and target sentences is extracted by above step, i.e.,Pass through again Element product and absolute element difference capture their match information, are then fed to full articulamentum to estimate that sentence is mutually translated Probability, specific formula is as follows:

p(y_i|c_i)=σ (W^cc_i+c) (14)

Wherein σ represents activation primitive, W^a,W^a,W^a, b, c are the parameters of model, which selects cross entropy as target letter Number:

Wherein n is the number of source sentence, and m is the number of candidate target sentence.

Experimental data set source of the invention is mainly the English Burma intertranslation article crawled from English Burma intertranslation website, by artificial It is screened and is aligned to obtain 20,000 parallel sentence pairs.Go out the bilingual term vector of high quality using Muse tool pre-training, word to Amount dimension is set as 300 dimensions.Actual conditions are shown in Table 4；

4 corpus scale of table

Language	Sentence number (ten thousand)
		English-Burma parallel corpora	2.0
English-Burma non-parallel corpus	14.0

Meanwhile the number based on the negative sample for assuming each parallel sentence pairs is 7, constructs 140,000 non-parallel corpus at random. In order to measure the performance of the parallel classifier in the English-Burma well, selection 0.2 Wan Ying-Burma parallel corpora, 0.4 Wan Ying-Burma are non- Test set of the parallel corpora as experiment, is shown in Table 5

5 experimental data of table

In terms of evaluation index, select accuracy rate (Accuracy), accurate rate (Precision), recall rate (Recall) as measurement, whether the model can make correct classification to the parallel sentence in English-Burma with F value (F1-Measure).

Specific formula is as follows:

Wherein TP is genuine positive example, and FP is false counter-example, and FP is false positive example, and TN is genuine counter-example.

In order to embody method validity proposed by the invention, Gr é goire, Francis et al. are proposed to utilize by the present invention Bidirectional Recurrent Neural Networks method extracts parallel sentence pairs as benchmark model.Meanwhile in order to Prominent deep learning has better accuracy than conventional machines study in the building of classifier, and the present invention is also by Munteanu The maximum entropy model that D S et al. is proposed is tested as a comparison.

The selection of experiment parameter directly affects last experimental result, and following table lists BiLSTM, CNN and experiment parameter Setting.It is shown in Table 6,7,8.

6 BiLSTM parameter setting of table

Parameter	Parameter value
		Embeding layer dimension	300 dimensions
Hidden state dimension	300 dimensions
		The number of plies	3 layers

7 CNN parameter setting of table

Parameter	Parameter value
		Sliding window size	2,3,4
Sliding window number	300
		Hidden state dimension	600 dimensions
Pond layer	Maximum pond

The setting of 8 experiment parameter of table

Parameter	Parameter value
		Batch	128
Learning rate	0.001
		Threshold value	0.90
Dropout	0.8
		Loss function	Cross entropy
Majorized function	Adam

It, will for the performance for verifying the remote bilingual parallel sentence pairs extraction model of the English proposed by the present invention based on BiLSTM-CNN BiLSTM and maximum entropy compare under different threshold values with BiLSTM-CNN method proposed by the present invention as benchmarks. It is shown in Table 9

9 Experimental comparison's table of table

From the results, it was seen that carry out the classifier that feature extraction trains using deep learning compares maximum entropy in performance Model training go out classifier it is preferable, preferable feature can be learnt automatically by being primarily due to neural network, meanwhile, LSTM and The combination of CNN is preferable using LSTM effect than merely, and analyzing its reason is mainly since LSTM is to global semantic extraction, In Prior semantic feature, the in this way semanteme than matching two sentences using simple LSTM can be obtained by CNN thereon It is more accurate.Different from the threshold value of selection, the effect of model also can be different, takes out if it is preferable parallel sentence pairs in order to obtain Modulus type, usual threshold value are arranged 0.9 or more.It accuracy rate and recalls worth line chart and sees Fig. 4,5.

In the present invention, great influence effect can be had to sentence pair extraction model by equally proposing function mark.Therefore, The present invention has carried out the model for incorporating function mark and the model for not incorporating function mark in the case where threshold value is 0.9 following Several groups of experiments.It is shown in Table 10.

Whether table 10 incorporates the model comparison of function mark

Model method	Precision (%)	Recall (%)	F-Measure (%)
				CNN	62.4%	60.3%	61.5%
CNN+ function mark	63.5%	61.2%	62.3%
				Bi-LSTM	68.3%	62.4%	65.9%
LSTM+ function mark	69.4%	62.8%	56.5%
				BiLSTM+mean_pooling	68.6%	62.6%	66.0%
BiLSTM+mean_pooling+ function mark	69.5%	63.1%	66.6%
				BiLSTM+max_pooling	67.9%	62.2%	65.1%
BiLSTM+max_pooling+ function mark	68.1%	62.4%	65.5%
				BiLSTM-CNN	71.6%	64.3%	67.2%
BiLSTM-CNN+ function mark	72.2%	65.1%	68.3%

From table 10, it can be seen that the model of fusion function mark has certain than the model that not fusion function marks It is promoted by a small margin, is primarily due to function mark as external knowledge, can have certain guidance to the expression of sentence Property effect.Meanwhile the present invention also provides three groups of comparative experimentss i.e. to the operation after BiLSTM: mean_pooling, max_ Pooling and CNN, is compared by experimental result, and the effect of CNN is still more excellent, and main cause is still CNN to important The extraction of feature；The effect that BiLSTM-CNN+ function of the invention marks is best.

Data acquisition module, for using web crawlers technology obtain network on English Burma intertranslation article and to data into Row cleaning；Term vector module carries out the instruction of bilingual term vector for the Muse tool using Facebook to the data got Practice；Function labeling module, for using Kunming University of Science and Technology's Burmese Words partition system to Burmese carry out participle and syllable cut Point, it recycles Burmese function word and auxiliary word to carry out function mark to Burmese, function mark is carried out to English by Stamford tool Note；Sentence characterization module, for extracting the semantic feature of sentence using BiLSTM-CNN；Output module, to source statement and mesh After poster sentence is encoded, i.e., extraction sentence semantic feature after, for by using semantic feature element product and Element absolute difference captures the match information of the semantic feature of sentence and target sentences, and match information is fed to full articulamentum, Output probability is extracted parallel sentence pairs as whether measurement is the condition of parallel sentence pairs.

Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN, it is characterised in that:

Specific step is as follows for the method:

Step1, using the English Burma intertranslation article crawled from English Burma intertranslation website, by manually being screened and being aligned to obtain 20,000 A parallel sentence pairs carry out bilingual term vector pre-training, obtain the word embedded space that English-Burma is shared across language, to obtain English Burma again Bilingual term vector, the semantic vector for allowing sentence to characterize have correlation across English-Burma semantic space；

Step2, functional label is carried out to sentence, the syntactic structure information of each word is spliced in term vector, obtain English-Burma Syntactic gap；

Step3, it is transmitted, is obtained comprising up and down using the information that BiLSTM carries out forward and reverse to each word information in sentence Significant condition caused by the different timing of literary information；Using the syntactic feature of CNN network representation sentence, the semanteme of this is obtained Feature；

Step4, the element product of the semantic feature obtained by using above-mentioned steps and element absolute difference come capture source sentence and Match information is fed to full articulamentum by the match information of the semantic feature of target sentences, using output probability as measure whether Parallel sentence pairs are extracted for the conditions of parallel sentence pairs.

2. the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN according to claim 1, it is characterised in that: Specific step is as follows by the step Step1:

The English Burma intertranslation article crawled from English Burma intertranslation website obtains 20,000 parallel sentence pairs by manually being screened and being aligned；

Using Kunming University of Science and Technology research and develop Burmese Words partition system, network address 222.197.219.24:8099, to Burmese into Row participle and the method training by using English-Burma seed dictionary, carrying out having supervision, using general Luke, this spy is aligned Procrustes alignment is iterated, and learns the mapping from original language to object language, and it is shared across language to obtain English-Burma Word embedded space, to obtain the remote bilingual term vector of English；Wherein the muse system of Facebook can be used to realize.

3. the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN according to claim 1, it is characterised in that: In the step Step2, when carrying out functional label to sentence, the Subject, Predicate and Object in sentence is identified by Burmese function word and auxiliary word, Include:

Burmese is segmented using Burmese Words partition system；

It is identified by the Burmese function word for being placed on noun and preposition function word and subject and predicate, object component auxiliary word in noun Function word auxiliary word in sentence；

When carrying out functional label to sentence, function mark is carried out to English by Stamford tool.

4. the remote bilingual parallel sentence pairs abstracting method of the English based on BiLSTM-CNN according to claim 1, it is characterised in that: The step Step3,4 specific step is as follows:

It is transmitted by the information that BiLSMT carries out forward and reverse to each word information in sentence, is believed using the context of word Breath, to obtain significant condition caused by the different timing comprising contextual information；

Using the syntactic feature of CNN network representation sentence, convolution sum Chi Hualai is carried out to obtained significant condition and extracts sentence pass Key semantic feature, the Deep Semantics for obtaining this indicate；

Use Adam as model optimizer；

Cross entropy is used to assess as loss function model；

The element product of the semantic feature obtained by using above-mentioned steps and element absolute difference capture source sentence and target sentence The match information of the semantic feature of son, is fed to full articulamentum for match information, is then fed to full articulamentum to estimate sentence The probability mutually translated, then output probability is extracted parallel sentence pairs as whether measurement is the condition of parallel sentence pairs, by defeated Probability is to determine whether be parallel sentence pairs out；Wherein, by output probability and threshold value comparison, greater than then judging whether it is for threshold value Parallel sentence pairs, threshold value take 0.8 or 0.9.

5. a kind of remote bilingual parallel sentence pairs draw-out device of the English based on BiLSTM-CNN characterized by comprising

Data acquisition module, for obtaining the English Burma intertranslation article on network using web crawlers technology and being carried out to data clear It washes；

Term vector module carries out the training of bilingual term vector for the Muse tool using Facebook to the data got；

Function labeling module carries out participle and syllable splitting to Burmese using Burmese Words partition system, recycles Burmese Function word and auxiliary word carry out function mark to Burmese, carry out function mark to English by Stamford tool；

Output module, after being encoded to source statement and object statement, i.e., after the semantic feature of extraction sentence, for leading to It crosses using the element product of semantic feature and element absolute difference and captures the match information of the semantic feature of sentence and target sentences, Match information is fed to full articulamentum, output probability is extracted parallel sentence as whether measurement is the condition of parallel sentence pairs It is right.