CN109657244B

CN109657244B - English long sentence automatic segmentation method and system

Info

Publication number: CN109657244B
Application number: CN201811549280.6A
Authority: CN
Inventors: 张睦
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2023-04-18
Anticipated expiration: 2038-12-18
Also published as: CN109657244A

Abstract

The embodiment of the invention provides an English long sentence automatic segmentation method and a system, wherein the method comprises the following steps: obtaining English long sentences to be divided; and inputting the English long sentence to be divided into the trained sequence and the neural network model of the sequence frame, and outputting two English short sentences. According to the method and the system for automatically segmenting the English long sentence, provided by the embodiment of the invention, the mode recognition is carried out by utilizing the sequence-to-sequence neural network model, the English long sentence is automatically segmented into two short sentences, and the human resources are greatly saved.

Description

English long sentence automatic segmentation method and system

Technical Field

The embodiment of the invention relates to the technical field of translation, in particular to an automatic English long sentence segmentation method and system.

Background

After an english translator from a native chinese language country completes the translation from chinese to english, a translation company often invites a language expert from the native english language country to examine the translation of the translator in order to further ensure the translation quality. By comparing the translated translations of a batch of translators with the ones reviewed by experts, it can be seen that the most popular type of review modification from foreign experts, in addition to some simple grammar, spelling and editing error correction, is the replacement of a long english sentence with two short sentences of the same origin and logical consistency.

Because of the difference between Chinese and English in languages, when a short Chinese text is translated into English, the complete information is often described by a larger length of English text. Meanwhile, the translation of the checking version of the foreign expert is more reasonable in language organization, and the reading sense is better brought to the reader by the way of the literary composition. However, the labor cost of inviting the foreign experts to do the checking work is very high, so that the labor resource is a great waste.

Therefore, there is a need for an automatic English long sentence segmentation method to solve the above problems.

Disclosure of Invention

In order to solve the above problems, embodiments of the present invention provide an automatic English long sentence segmentation method and system that overcome the above problems or at least partially solve the above problems.

The first aspect of the present invention provides an automatic English long sentence segmentation method, including:

obtaining English long sentences to be divided;

and inputting the English long sentence to be divided into the trained sequence-to-sequence frame neural network model, and outputting two English short sentences.

In a second aspect, an embodiment of the present invention provides an automatic English long sentence segmentation system, including:

the acquisition module is used for acquiring English long sentences to be divided;

and the automatic cutting module is used for inputting the English long sentence to be cut into the trained sequence to the neural network model of the sequence frame and outputting two English short sentences.

Third aspect an embodiment of the present invention provides an electronic device, including:

a processor, a memory, a communication interface, and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the memory stores program instructions which can be executed by the processor, and the processor calls the program instructions to execute the English long sentence automatic segmentation method.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, which stores computer instructions, where the computer instructions cause the computer to execute the method for automatically segmenting long english sentences as described above.

According to the method and the system for automatically segmenting the English long sentence, provided by the embodiment of the invention, the mode recognition is carried out by utilizing the sequence-to-sequence neural network model, the English long sentence is automatically segmented into two short sentences, and the human resources are greatly saved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of an automatic English long sentence segmentation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an encoder structure provided in an embodiment of the present invention;

FIG. 3 is a block diagram of a first phrase decoder according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a second syntax decoder according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an automatic english long sentence segmentation system according to an embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, after an english translator from a native chinese language country completes the chinese to english translation, a translation company usually invites a language expert from the native english language country to examine the translation of the translator in order to further ensure the translation quality. By comparing the translations translated by a batch of translators with the translations reviewed by experts, it has been found that in addition to some simple grammar, spelling and editing error correction, the most popular type of review modification from foreign experts is the replacement of a long english sentence by two short sentences that are identical in meaning and logically consecutive. The main reason for this is not the translation expertise of the translator itself, but rather the linguistic differences between Chinese and English. When a short piece of chinese text is translated into english, the complete information is often described in more extensive english text.

Table 1 exemplary translation examples provided by embodiments of the present invention

Table 1 is a typical translation example provided by the embodiment of the present invention, and as shown in table 1, the difference between two different translations in table 1 is small in terms of the editing distance of the character string. It is clear that the translation of the checked version of the foreign expert is more reasonable in language organization and that this way of speaking the line to the reader gives a better reading. However, the labor cost of inviting the foreign expert to do the checking work is very high.

To solve the above problem, fig. 1 is a schematic flow chart of an automatic english long sentence segmentation method according to an embodiment of the present invention, as shown in fig. 1, including:

101. obtaining English long sentences to be divided;

102. and inputting the English long sentence to be divided into the trained sequence and the neural network model of the sequence frame, and outputting two English short sentences.

It can be understood that, in order to achieve the effect of automatically segmenting an english long sentence, the embodiment of the present invention provides a neural network model from a trained sequence to a sequence frame to automatically segment any english long sentence to be segmented, the long sentence segmentation function can be automatically completed only by inputting the english long sentence to be segmented into the trained sequence to the neural network model of the sequence frame, and the segmented sentences are two english short sentences, which can be referred to as a first short sentence and a second short sentence in the embodiment of the present invention.

Specifically, in step 101, in the embodiment of the present invention, one or more english long sentences to be divided need to be obtained. It should be noted that, in the embodiment of the present invention, no limitation is made to the specific length and type of the english long sentence.

Then, in step 102, the English long sentence to be segmented is input into the neural network model of the trained sequence-to-sequence framework, the model is a neural network model which is trained in advance and can automatically complete the short sentence segmentation function, historical original texts, translator translations and checking translations need to be accumulated as training sets in the training process, and are marked respectively, so that the neural network model can learn the translation mode of the checking translations, and the final segmentation is completed.

According to the method and the system for automatically segmenting the English long sentence, the mode identification is carried out by utilizing the sequence-to-sequence neural network model, the English long sentence is automatically segmented into two short sentences, and the human resources are greatly saved.

On the basis of the above embodiment, before the english long sentence to be divided is input into the neural network with the trained sequence-to-sequence framework and two english short sentences are output, the method further includes:

obtaining a corpus data set, wherein the corpus data set comprises an original text, a translator translation and an auditing translation;

and training a preset sequence-to-sequence framework neural network model by using the corpus data set as a training sample set to obtain the trained sequence-to-sequence framework neural network model.

It can be known from the content of the above embodiment that the embodiment of the present invention provides a trained neural network model from a sequence to a sequence frame, and then the neural network model from the sequence to the sequence frame needs a training sample set to perform training to complete an automatic segmentation function.

In the embodiment of the invention, the original text, the translator translation and the checking translation are used as corpus data sets, and are used as training sets in a one-to-one correspondence mode to train the neural network model from the sequence to the sequence frame. It should be noted that the selected corpus data set is data that has been translated historically, and the latest wikipedia chinese and english monolingual corpus is downloaded and participled during the training process. Then, chinese and English word vectors are trained by using a Skip-Gram algorithm, wherein the most main training hyper-parameters can be preferably set as follows: the dimension of the word vector is 300 and the context window is 5.

And finally, training a sequence to sequence frame neural network model based on the training sample set and the Chinese and English word vectors which are trained firstly.

On the basis of the above embodiment, before the training of the neural network model from a preset sequence to a sequence framework with the corpus data set as a training sample set, the method further includes:

and performing data preprocessing of word segmentation and sentence segmentation on the text in the corpus data set.

It can be known from the content of the above embodiment that the embodiment of the present invention accumulates the original text, the translator translation and the reviewing translation as the corpus data set, and then the embodiment of the present invention performs the word segmentation and sentence segmentation preprocessing on the text in the corpus data set.

Specifically, the method comprises the step of screening N triples (one original sentence, one translator translation sentence, (a first checking translation sentence, a second checking translation sentence)) from the triples to serve as model training and verification tests.

The invention reasonably records the data set as D = { D = { (D) } ₁ ,D ₂ ,D ₃ ,…,D _N In which D is _i ＝(SRC _i ,TRAS _i ,(REVIEW _i1 ,REVIEW _i2 )). Randomly extracting 20% from D as a verification test set D _test The remaining 80% was used as trainingExercise and Collection D _train 。

On the basis of the above embodiment, the sequence-to-sequence framework neural network model includes:

the system comprises an original text encoder, a translated text encoder, a first short sentence decoder and a second short sentence decoder.

The training of the neural network model from a preset sequence to a sequence frame by taking the corpus data set as a training sample set comprises the following steps:

combining the original text vector and the translated text vector in the training sample set into a first vector based on the original text encoder and the translated text encoder;

generating a first phrase and a second vector based on the first phrase decoder and the first vector;

generating a second phrase based on the second phrase decoder and the second vector.

The sequence-to-sequence framework neural network model provided by the embodiment of the invention mainly comprises four components, namely an original text encoder, a translated text encoder, a first short sentence decoder and a second short sentence decoder.

FIG. 2 is a schematic diagram of an encoder according to an embodiment of the present invention, as shown in FIG. 2, namely an original encoder and a translation encoder, which encode the original into an original vector C using a recurrent neural network LSTM _src And translation vector C _trans And are combined by concatenation into a new vector C, the first vector in the embodiment of the present invention.

For example: the original text' I did not want to preserve the incenses before, but rather wishes to drag the tiger with the old and weak, and create a chance for you to escape. "and translator translation" What I ear before do not mean narrowing the translation using my old and week body so all have a change to escape "are encoded into vectors by the encoder, respectively, and are connected into a new vector C.

Next, fig. 3 is a schematic diagram of a first phrase decoder according to an embodiment of the present invention, and as shown in fig. 3, a word vector is used as a first decoderInputting and combining the first vector to generate the first short sentence and obtain a new vector C _review1 I.e. the second vector in the embodiment of the present invention.

For example: a first phrase can be generated using a first phrase decoder: "generating the first short sentence" What I ear before do not mean condensing the differences protection ".

Finally, fig. 4 is a schematic diagram of a second syntax decoder according to an embodiment of the present invention, as shown in fig. 4, combining a vector C and a vector C _review1 The second phrase is generated as an input to a second phrase decoder.

For example: using the vector C produced by the second phrase encoder and the vector C produced by the decoder which generated the first phrase _review1 A second sentence "Rather, I'm with to use my old and free body to discrete Tiger so you all have a hand to escape" is generated in another decoder.

Fig. 5 is a schematic structural diagram of an english long sentence automatic segmentation system according to an embodiment of the present invention, as shown in fig. 5, including: an obtaining module 501 and an automatic segmentation module 502, wherein:

the obtaining module 501 is configured to obtain an english long sentence to be divided;

the automatic segmentation module 502 is configured to input the long english sentence to be segmented into the trained sequence and output two short english sentences in the neural network model of the sequence frame.

Specifically, how to automatically split the long english sentence through the obtaining module 501 and the automatic splitting module 502 may be used to execute the technical scheme of the embodiment of the method for automatically splitting the long english sentence shown in fig. 1, which has similar implementation principles and technical effects, and is not described herein again.

On the basis of the above embodiment, the system further includes:

the training module is used for acquiring a corpus data set, wherein the corpus data set comprises an original text, a translator translation and an examining and correcting translation;

and training a preset sequence-to-sequence frame neural network model by using the corpus data set as a training sample set to obtain the trained sequence-to-sequence frame neural network model.

On the basis of the above embodiment, the system further includes:

and the preprocessing module is used for preprocessing the data of word segmentation and sentence segmentation of the text in the corpus data set.

On the basis of the above embodiment, the training module includes:

the encoding unit is used for combining the original text vectors and the translated text vectors in the training sample set into a first vector based on the original text encoder and the translated text encoder;

a first decoding unit configured to generate a first phrase and a second vector based on the first phrase decoder and the first vector;

a second decoding unit configured to generate a second phrase based on the second phrase decoder and the second vector.

An embodiment of the present invention provides an electronic device, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein:

fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention, and referring to fig. 6, the electronic device includes: a processor (processor) 601, a communication Interface (Communications Interface) 602, a memory (memory) 603 and a bus 604, wherein the processor 601, the communication Interface 602 and the memory 603 complete communication with each other through the bus 604. The processor 601 may call logic instructions in the memory 603 to perform the following method: obtaining English long sentences to be divided; and inputting the English long sentence to be divided into the trained sequence and the neural network model of the sequence frame, and outputting two English short sentences.

An embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer can execute the methods provided by the above method embodiments, for example, the method includes: obtaining English long sentences to be divided; and inputting the English long sentence to be divided into the trained sequence-to-sequence frame neural network model, and outputting two English short sentences.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions, which cause a computer to execute the method provided by the above method embodiments, for example, including: obtaining English long sentences to be divided; and inputting the English long sentence to be divided into the trained sequence and the neural network model of the sequence frame, and outputting two English short sentences.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in each embodiment or some portions of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An automatic English long sentence segmentation method is characterized by comprising the following steps:

obtaining English long sentences to be divided;

inputting the English long sentence to be divided into the trained sequence and the neural network model of the sequence frame, and outputting two English short sentences;

before the English long sentence to be divided is input into the neural network of the trained sequence-to-sequence framework and two English short sentences are output, the method further comprises the following steps:

training a preset sequence-to-sequence framework neural network model by using the corpus data set as a training sample set to obtain the trained sequence-to-sequence framework neural network model;

the sequence-to-sequence framework neural network model comprises:

the system comprises an original text encoder, a translated text encoder, a first short sentence decoder and a second short sentence decoder;

combining the original text vector and the translation vector in the training sample set into a first vector based on the original text encoder and the translation encoder;

2. The method according to claim 1, wherein before said training the corpus data set as a training sample set on a preset sequence-to-sequence framework neural network model, the method further comprises:

3. An automatic English long sentence segmentation system is characterized by comprising:

the automatic segmentation module is used for inputting the English long sentence to be segmented into the trained sequence and outputting two English short sentences into the neural network model of the sequence frame;

before inputting the long english sentence to be divided into the neural network of the trained sequence-to-sequence framework and outputting two short english sentences, the method further comprises:

acquiring a corpus data set, wherein the corpus data set comprises original text, translator translation and checking translation;

training a preset sequence-to-sequence frame neural network model by using the corpus data set as a training sample set to obtain the trained sequence-to-sequence frame neural network model;

the sequence-to-sequence framework neural network model comprises:

4. An electronic device, comprising a memory and a processor, wherein the processor and the memory communicate with each other via a bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to be capable of performing the method of claim 1 or 2.

5. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of claim 1 or 2.