CN112446224A

CN112446224A - Parallel corpus processing method, device and equipment and computer readable storage medium

Info

Publication number: CN112446224A
Application number: CN202011415780.8A
Authority: CN
Inventors: 方恺齐; 崔春来
Original assignee: Guangzhou Caicheng Ming Technology Co ltd; Beijing Caiyun Ring Pacific Technology Co ltd
Current assignee: Guangzhou Caicheng Ming Technology Co ltd; Beijing Caiyun Ring Pacific Technology Co ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-03-05

Abstract

The application provides a parallel corpus processing method, a device, equipment and a computer readable storage medium, wherein the method carries out sentence splitting operation on a target parallel corpus to obtain M sentences of original texts of original text documents and N sentences of translated texts of translated text documents in the target parallel corpus; encoding the M sentences of original texts and the N sentences of translated texts to obtain a vector corresponding to each sentence of original text and a vector corresponding to each sentence of translated text; according to the obtained vector, performing segmentation operation on the target parallel corpus to obtain a plurality of bilingual inter-translation segments; and carrying out alignment operation on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus, so that the workload is reduced, and the accuracy and the efficiency are improved.

Description

Parallel corpus processing method, device and equipment and computer readable storage medium

Technical Field

The present invention relates to the field of translation technologies, and in particular, to a parallel corpus processing method, apparatus, device, and computer-readable storage medium.

Background

The parallel corpus is a bilingual corpus consisting of an original text and a translated text corresponding to the original text in parallel, is an important resource for training a machine translation model, and has indispensable functions on translation conversion research in the translation field, translation style research and filling of semantic item deficiency of a bilingual dictionary.

In the related technology, parallel corpus processing is performed on a long document, and a common means is to use a manually formulated rule to screen and construct feature words and phrases which can represent similarity in two sentences, use the word features to calculate the similarity between the two sentences, and determine the alignment result of the parallel corpus according to the similarity.

However, the method can only align single sentences, cannot accurately process the situation of multiple sentences, and has low accuracy, large workload and low efficiency when processing long documents.

Disclosure of Invention

The application provides a parallel corpus processing method, a device, equipment and a computer readable storage medium, thereby solving the technical problems that the prior art can only align a single sentence to a single sentence, can not accurately process the condition of multiple sentences to multiple sentences, and has low accuracy, large workload and low efficiency when processing long documents.

In a first aspect, the present application provides a parallel corpus processing method, including:

performing sentence splitting operation on a target parallel corpus to obtain M sentences of original texts of original text documents and N sentences of translated texts of translated text documents in the target parallel corpus;

encoding the M sentences of original texts and the N sentences of translated texts to obtain a vector corresponding to each sentence of original text and a vector corresponding to each sentence of translated text;

according to the obtained vector, performing segmentation operation on the target parallel corpus to obtain a plurality of bilingual inter-translation segments;

and carrying out alignment operation on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus.

The method includes the steps of firstly carrying out sentence segmentation and coding operation on a target parallel corpus, so that accurate similarity calculation can be carried out on each short sentence, finding out a sentence pair capable of being used for segmenting a document, segmenting and segmenting the whole document to obtain a plurality of short documents, and obtaining an alignment result of the short documents through sentence splicing and similarity calculation of the short documents.

Optionally, the performing a segmentation operation on the target parallel corpus according to the obtained vector to obtain a plurality of bilingual inter-translation segments includes:

calculating first similarity of M-N sentence pairs formed by any one of the M sentences of original texts and any one of the N sentences of translated texts according to the obtained vector;

determining a sentence pair for segmenting the document according to a first preset rule and the first similarity;

and carrying out segmentation operation on the target parallel corpus according to the sentence pairs of the segmented document to obtain the multiple bilingual inter-translation segments.

The method and the device for segmenting the target inter-translation document determine sentence pairs, namely segmentation points, for segmenting the document by calculating the similarity of all the original texts and all the translations, and segment the target inter-translation document according to the sentences of the segmented document.

Optionally, the aligning operation is performed on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus, including:

splicing the bilingual inter-translation segments into a plurality of sentence combinations according to the sequence of the target parallel linguistic data;

coding the sentence combinations to obtain a vector corresponding to each combination;

calculating a second similarity of the original text and the translated text in the plurality of sentence combinations according to the obtained vector;

and obtaining an alignment result of the target parallel corpus according to a second preset rule and the second similarity.

Here, in the embodiment of the present application, the split short documents are spliced according to the original sequence of the target parallel corpus, so that a plurality of sentence combinations can be obtained, and then the similarity of all the spliced sentence combinations is calculated, so that a sentence combination with the highest similarity between the original document and the translated document can be obtained, thereby achieving accurate alignment and further improving the accuracy of the target parallel corpus alignment.

Optionally, the determining a sentence pair for segmenting the document according to the first preset rule and the first similarity includes:

deleting the sentence pairs with the first similarity lower than a first preset similarity threshold from the M sentence pairs, and screening the rest sentence pairs in the M sentence pairs according to the first preset rule;

and determining sentence pairs of the segmented document according to the screening result.

Here, the embodiment of the present application deletes the sentence pair whose first similarity is lower than the first preset similarity threshold, and simultaneously, according to the first preset rule, may screen out the sentence pair suitable for being used as a locating point, may accurately determine the segmentation point of the segmented document, and performs accurate and fine segmentation on the target parallel corpus, thereby further improving the accuracy of the target parallel corpus alignment.

Optionally, before the screening of the remaining sentence pairs in the M × N sentence pairs, the method further includes:

determining the relation between sentences in the remaining sentence pairs;

and if the similarity of any one sentence in the sentence pair to be processed in the rest sentence pairs is lower than the similarity of the same sentence in the other sentence pairs according to the relation, deleting the sentence pair to be processed.

Here, in the embodiment of the present application, before the remaining sentence pairs in M × N sentence pairs are screened, the relationship between the sentence pairs is determined, the sentence pairs are screened, so that the sentence pairs used for segmenting the document are determined in the remaining sentence pairs, and in addition, if any sentence in one sentence pair occurs in a sentence pair with a score higher than that of the sentence pair, it may be determined that the matching degree of the other sentence pair is higher than that of the current sentence pair, and then the current sentence pair is deleted.

if the ratio of the words to the word number of a certain sentence pair exceeds a first preset ratio threshold value, deleting the sentence pair;

alternatively, the first and second electrodes may be,

if the ratio of the segment length of a certain sentence pair after being divided according to the label exceeds a second preset ratio threshold, deleting the sentence pair;

alternatively, the first and second electrodes may be,

splicing any sentence in a certain sentence pair with the previous sentence or the next sentence, recalculating the similarity score of the modified sentence pair, and deleting the sentence pair if the similarity score of the new sentence pair is higher;

alternatively, the first and second electrodes may be,

deleting a sentence pair if the score of a natural Language Understanding reference (GLEU) of the translated sentence pair into the same Language is lower than a preset score threshold;

alternatively, the first and second electrodes may be,

and if the number of the intersection points generated by a certain sentence pair and other sentence pairs is larger than the preset intersection number, deleting the sentence pair.

Here, the embodiments of the present application provide several ways of screening sentence pairs before screening the remaining sentence pairs in M × N sentence pairs, select less sentence pairs by the ratio of the number of words or words in the sentence pairs, the ratio of the length of the segment after division according to the label, the similarity score, the GLEU score, and the number of intersections between the original text and the translated text, screen and delete the sentence pairs unsuitable as the division target parallel corpus in advance, reduce the workload of subsequently determining the sentence pairs for segmenting the document, and further improve the efficiency and accuracy of segmentation and alignment.

In a second aspect, an embodiment of the present application provides a parallel corpus processing apparatus, including:

the sentence splitting module is used for performing sentence splitting operation on the target parallel corpus to obtain M sentences of original texts of the original text documents and N sentences of translated text documents of the target parallel corpus;

the coding module is used for coding the M sentences of original texts and the N sentences of translated texts to obtain a vector corresponding to each sentence of original text and a vector corresponding to each sentence of translated text;

the segmentation module is used for carrying out segmentation operation on the target parallel corpus according to the obtained vector to obtain a plurality of bilingual inter-translation segments;

and the alignment module is used for performing alignment operation on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus.

Optionally, the segmentation module is specifically configured to:

Optionally, the alignment module is specifically configured to:

Optionally, the segmentation module is specifically configured to:

Optionally, before the filtering of the remaining sentence pairs of the M × N sentence pairs, the segmenting module is further configured to:

determining the relation between sentences in the remaining sentence pairs;

alternatively, the first and second electrodes may be,

if any sentence in a certain sentence pair is spliced with the previous sentence or the next sentence in the original text, if the similarity score of the new sentence pair (the similarity score can be the GLEU score translated into the same language or the cosine similarity after being coded by a coder and the like) is higher, deleting the sentence pair;

alternatively, the first and second electrodes may be,

if the score of the GLEU of the natural language understanding reference after the sentence pair is translated into the same language is lower than a first preset score, deleting the sentence pair;

alternatively, the first and second electrodes may be,

In a third aspect, an embodiment of the present application provides a parallel corpus processing apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of parallel corpus processing according to the first aspect or the alternatives thereof.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used to implement the parallel corpus processing method according to the first aspect or the optional manner of the first aspect.

The method comprises the steps of firstly carrying out sentence segmentation and coding operation on a target parallel corpus, thus accurately calculating the similarity of each short sentence, finding out sentence pairs capable of being used for segmenting a document, segmenting and segmenting the whole document to obtain a plurality of short documents, and obtaining the alignment result of the long document by splicing the short documents and calculating the similarity.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a diagram illustrating a parallel corpus processing system according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a parallel corpus processing method according to an embodiment of the present application;

FIG. 3 is a flowchart of another parallel corpus processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a segmentation process provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an alignment process provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a parallel corpus processing apparatus according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a parallel corpus processing apparatus according to an embodiment of the present application;

with the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms "first," "second," "third," and "fourth," if any, in the description and claims of this application and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The parallel linguistic data are bilingual or multilingual linguistic data formed by original texts and translated texts corresponding to the original texts in parallel, are important resources for training a machine translation model, and have indispensable functions on improving the machine translation quality in the translation field, researching the translation style, filling up semantic item loss of a bilingual dictionary and the like.

In the related art, the alignment process of parallel corpora is performed on a long document, which is generally divided into two steps. The first step is to calculate the similarity between two sentences of text by the design method, and the second step is to align the texts according to a specific algorithm through the calculated similarity. The commonly used means for calculating the similarity includes screening and constructing feature words and phrases which can represent the similarity in two sentences by using manually established rules, and calculating the similarity between the two sentences by using the word features and some rules, however, in the prior art, a word list corresponding to one word list needs to be manually constructed, and at present, no complete rule for calculating the similarity exists, so that patches need to be added to the rule for calculating the similarity according to different situations, and the workload is consumed. Among the second-step alignment algorithms, the currently widely used alignment algorithm based on dynamic programming is slow in processing long documents, and because it is theoretically required to calculate similarity for each sentence combination (including wrong ones) and to take the similarity into consideration when calculating the alignment result, it is difficult to ignore the poor quality, and it is easy to align the wrong sentence pairs, so that the quality of the misaligned sentences caused by the poor quality itself or the imperfect rules and other error factors may also reduce the alignment quality of neighboring sentences. The improved alignment scheme for calculating the similarity by using the neural network encodes a sentence into a vector by using an encoder trained on multi-language data pairs according to tasks such as translation or classification, and then calculates the similarity of the vectors of the two sentences by using a certain algorithm, thereby avoiding wrong scores caused by imperfect artificial rules in the first step to a certain extent and carrying out final alignment according to some algorithms. The drawback of this method is that if a dynamic programming algorithm is used in the second step alignment algorithm, the above mentioned quality problems cannot be circumvented. However, other alignment schemes that do not adopt dynamic programming cannot conveniently handle the situation of multiple sentences, only a single sentence can align a single sentence, and then the single sentence alignment is expanded to multiple sentence alignment, so that the accuracy is not high enough.

In order to solve the above problems, embodiments of the present application provide a parallel corpus processing method, apparatus, device, and computer-readable storage medium, when processing bilingual inter-translated documents, first perform sentence segmentation and encoding operations on a target parallel corpus, so as to perform accurate similarity calculation for each short sentence, find a sentence pair capable of being used for segmenting a document, segment and segment the entire document to obtain a plurality of short documents, and obtain an alignment result of a long document by performing concatenation and similarity calculation on the short documents, because the embodiment of the present application can segment the long document into short documents, and align the complex long document into concatenation and segmentation alignments of short documents, the problem that only a single sentence can be aligned in the existing non-dynamic programming algorithm scheme and multiple sentences can not be accurately processed is solved, and because similarity scores are sorted and screened, therefore, local errors generated when sentences which are easy to make mistakes are aligned by using a dynamic programming algorithm can be avoided, and the accuracy and the efficiency of target parallel corpus alignment are improved.

Optionally, fig. 1 is a schematic diagram of a parallel corpus processing system according to an embodiment of the present disclosure. In fig. 1, the above-described architecture includes at least one of a receiving device 101, a processor 102, and a display device 103.

It is understood that the illustrated structure of the embodiment of the present application does not form a specific limitation to the architecture of the parallel corpus processing system. In other possible embodiments of the present application, the foregoing architecture may include more or less components than those shown in the drawings, or combine some components, or split some components, or arrange different components, which may be determined according to practical application scenarios, and is not limited herein. The components shown in fig. 1 may be implemented in hardware, software, or a combination of software and hardware.

In a specific implementation process, the receiving device 101 may be an input/output interface or a communication interface.

The processor 102 can segment long documents into short documents when processing bilingual inter-translation documents, and further segment the alignment of complex long documents into the splicing alignment of short documents, accurately process the problem of multiple sentences to multiple sentences, do not need the alignment of single sentences to single sentences, reduce the workload of alignment, and improve the accuracy and efficiency of the alignment of target parallel corpora, thereby, based on the aligned bilingual inter-translation documents, performing translation conversion research, researching translation style and filling the loss of semantic items of a bilingual dictionary in the translation field.

The display device 103 may be used to display the above results and the like.

The display device may also be a touch display screen for receiving user instructions while displaying the above-mentioned content to enable interaction with a user.

It should be understood that the processor may be implemented by reading instructions in the memory and executing the instructions, or may be implemented by a chip circuit.

In addition, the network architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and it can be known by a person skilled in the art that along with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

The technical scheme of the application is described in detail by combining specific embodiments as follows:

fig. 2 is a flowchart of a parallel corpus processing method according to an embodiment of the present application. The execution subject of the embodiment of the present application may be the processor 102 in fig. 1, and the specific execution subject may be determined according to an actual application scenario. As shown in fig. 2, the method comprises the steps of:

s201: and performing sentence splitting operation on the target parallel corpus to obtain M sentences of original texts of the original text documents and N sentences of translated texts of the translated text documents in the target parallel corpus.

Wherein, M and N may be the same or different.

Optionally, the sentence dividing operation on the target parallel corpus may be performed based on punctuation marks in the target parallel corpus, or may be performed based on a preset word number determined by each sentence, or may be performed based on some specific words or phrases, which is not specifically limited in the present application.

S202: and coding the M sentences of original texts and the N sentences of translated texts to obtain a vector corresponding to each sentence of original text and a vector corresponding to each sentence of translated text.

Here, the encoding refers to obtaining a sentence as a vector.

Optionally, M sentences of original text and N sentences of translated text may be encoded in a sparse word vector manner.

Optionally, an encoder in an existing model of neural network translation or classification may be used to output an input sentence through a matrix operation of the neural network to obtain a vector of the sentence.

S203: and carrying out segmentation operation on the target parallel corpus according to the obtained vector to obtain a plurality of bilingual inter-translation segments.

S204: and carrying out alignment operation on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus.

The method and the device for aligning the target parallel linguistic data solve the problems that only a single sentence can be aligned by using a sentence vector technology or a plurality of sentences can not be accurately processed by using a dynamic programming algorithm in the prior art, and accuracy and efficiency of alignment of the target parallel linguistic data are improved.

Optionally, in the embodiment of the present application, a segmentation operation may be performed on the target parallel corpus according to a first similarity of M × N sentence pairs formed by any one of the M original sentences and any one of the N translated sentences, and accordingly, fig. 3 is a flowchart of another parallel corpus processing method provided in the embodiment of the present application, and as shown in fig. 3, the method includes:

s301: and performing sentence splitting operation on the target parallel corpus to obtain M sentences of original texts of the original text documents and N sentences of translated texts of the translated text documents in the target parallel corpus.

S302: and coding the M sentences of original texts and the N sentences of translated texts to obtain a vector corresponding to each sentence of original text and a vector corresponding to each sentence of translated text.

The steps S301 and S302 are the same as the steps S201 and S202, and are not described herein again.

S303: and calculating first similarity of M-N sentence pairs consisting of any one of the M sentences of original texts and any one of the N sentences of translated texts according to the obtained vector.

S304: and determining a sentence pair for segmenting the document according to a first preset rule and the first similarity.

Optionally, determining a sentence pair for segmenting the document according to a first preset rule and the first similarity includes:

deleting sentence pairs with the first similarity lower than a first preset similarity threshold from the M sentence pairs, and screening the rest sentence pairs in the M sentence pairs according to a first preset rule; and determining sentence pairs for segmenting the document according to the screening result.

and sequencing the first similarity from high to low, and if the sequence of the first similarity of the sentence pairs in all the first similarities is before the preset sequence, determining the sentence pairs as the sentence pairs of the segmented document.

deleting sentence pairs with the first similarity lower than a first preset similarity threshold from the M sentence pairs, and screening the rest sentence pairs in the M sentence pairs according to a first preset rule;

and determining sentence pairs for segmenting the document according to the screening result.

S305: and carrying out segmentation operation on the target parallel corpus according to the sentence pairs of the segmented document to obtain a plurality of bilingual inter-translation segments.

Alternatively, the first similarity may be a calculation that achieves the first similarity by bilingual evaluation of the replacement score BLEU score.

Optionally, the calculation of the first similarity may be determined by calculating a cosine similarity score between any two sentences of the original text translation, and optionally, the first similarity may be obtained by normalizing the cosine similarity between two sentence vectors, and the calculation formula is as follows:

in the above formula, score (x, y) is the relative similarity between two sentences (used herein as an example of the first similarity calculation), cos (x, y) is the cosine similarity score of the vector obtained by the encoder for two sentences, x represents the vector encoded by one sentence in language 1, y represents the vector encoded by one sentence in language 2,

the meaning of (1) is that for x, the vectors z of the first k sentences closest to the x are found from the language 2, then the cosine similarity is calculated respectively, and then k similarities are calculatedDegree is averaged, and formula

The meaning of (1) is also the same, namely for y, the first k vectors z which are most similar to y are found out from another language 1, then the cosine similarity of y and the k vectors is respectively calculated, and then the similarity is averaged.

Exemplarily, fig. 4 is a schematic diagram of a segmentation process provided in an embodiment of the present application, as shown in fig. 4, where 10 sentences are total after the sentence division of document 1, 13 sentences are total after the sentence division of document 2, and Srci and Tgtj (i e [1, 10], j e [1, 13]) respectively represent the i-th and j-th sentences in document 1 and document 2. Assuming that after the relative similarity is calculated and filtered according to a first preset rule, Src4, Tgt3, Src8 and Tgt10 are selected as segmentation points, the bilingual document can be divided into 3 segments which are aligned with each other, wherein the bilingual document 1 is divided into three segments including Src1-Src4, Src4-Src8 and Src9-Src10, and the bilingual document 2 is divided into three segments including Tgt1-Tgt3, Tgt4-Tgt9 and Tgt10-Tgt13, so that the next alignment between the single parallel segments can be performed.

Optionally, in order to ensure accuracy of the segmentation result and the determination of the segmentation point, the embodiment of the present application may filter sentence pairs in advance.

Optionally, before screening the remaining sentence pairs in the M × N sentence pairs, the method further includes:

determining the relation between sentences in the remaining sentence pairs; and if the similarity of any one sentence in the to-be-processed sentence pairs in the rest sentence pairs is lower than the similarity of the same sentence in the other sentence pairs according to the relation, deleting the to-be-processed sentence pairs.

Here, in the embodiment of the present application, before the remaining sentence pairs in M × N sentence pairs are screened, the relationship between the sentences in the remaining sentence pairs is determined, so that the sentence pairs used for segmenting the document can be determined in the remaining sentence pairs, in addition, the remaining sentence pairs can be ranked from high to low in similarity, if any sentence in one sentence pair appears in a sentence pair with a score higher than that of the sentence pair, it can be determined that the matching degree of the other sentence pair is higher than that of the current sentence pair, and then the current sentence pair is deleted.

Optionally, when calculating the similarity after combining, the similarity between all sentences in the original text and the translated text is calculated, that is, if there are N sentences in language 1 and M sentences in language 2, N × M similarities are always obtained, and then the N × M similarities are ranked from high to low, and the previously matched alignment result is removed from high to low. For example, score (first sentence in language 1, second sentence in language 2) > score (first sentence in language 1, first sentence in language 2), this alignment result (first sentence in language 1, first sentence in language 2) is removed, because a sentence cannot be aligned to two sentences at the same time, and only one sentence can be selected.

if the ratio of the words to the word number of a certain sentence pair exceeds a first preset ratio threshold value, deleting the sentence pair; or if the ratio of the segment length after a certain sentence pair is segmented according to the label exceeds a second preset ratio threshold, deleting the sentence pair; or after any sentence in a certain sentence pair is spliced with the previous sentence or the next sentence, recalculating the similarity score of the modified sentence pair, and deleting the sentence pair if the similarity score of the new sentence pair is higher; or deleting a sentence pair if the GLEU score of the sentence pair translated into the same language is lower than a preset score threshold; or deleting the sentence pair if the number of the intersection points generated by the sentence pair and other sentence pairs is larger than the preset intersection number.

S306: and carrying out alignment operation on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus.

Optionally, the method for aligning the bilingual inter-translation segments to obtain the spliced sentences is as follows:

and splicing the bilingual inter-translation segments into a plurality of sentence combinations according to the sequence of the target parallel linguistic data. And coding the sentence combinations to obtain a vector corresponding to each combination. And calculating a second similarity of the original text and the translated text in the plurality of sentence combinations according to the obtained vector. And obtaining an alignment result of the target parallel corpus according to a second preset rule and the second similarity.

It can be understood that when the bilingual inter-translation segments are combined, the number of the specific combinations can be one sentence by one sentence, which is equivalent to no combination, or two sentences by two sentences, three sentences by three sentences, four sentences by four sentences, and the like.

Optionally, the second similarity may be calculated by estimating the replacement score BLEU score through bilingual.

Optionally, the calculation of the second similarity may be determined by calculating a normalized score of cosine similarity between any sentence combinations of the original text translation, and the formula is as above:

optionally, the second preset rule may be that a preset second similarity threshold is determined, and the sentence combination with the second similarity greater than the second similarity threshold is determined as the alignment result, or the second preset rule may be that the second similarities are sorted, and a certain percentage of sentence combinations with the highest similarities are determined as the alignment result.

Exemplarily, fig. 5 is a schematic diagram of an alignment process provided in an embodiment of the present application, and as shown in fig. 5, assuming that a document 1 has m ═ 4 sentences and a document 2 has n ═ 3 sentences, after combining sentences in the document 1 by x ═ 4 sentences and combining sentences in the document 2 by y ═ 3 sentences in sequence, the document 1 may obtain S1 ═ m + (m-1) + (m-2) + (m-x) + 10 sentences, and the document 2 may obtain S2 ═ n + (n-1) + (n-2) + (n-54 + (n-y) > 6 sentences, as shown in the upper right corner of fig. 5, and then relative similarity scores between any two bilingual combinations in the paragraph are calculated ([ Srci, Srci + i, Srci +2, …, Srci + k ], [ tj + tj, tj +1, and 36j, where the document 1 is the document 82, the document 1 is the th word j, and the document 1 is denoted by tio), j is the jth sentence in document 2, k and o are integers not exceeding the respective combined sentence limits x and y, the scores are S1 × S2 ═ 10 × 6 ═ 60, the scores are sorted from high to low, a combination is deleted if any sentence in the combination is already included in the combination with a higher score, and after the combination with a rejection score lower than a certain threshold score is combined, the result obtained thereby is the final alignment result, i.e., sentence 1 in document 1 is aligned with sentence 1 and 2 in document 2, and sentence 3 in document 1 is aligned with sentence 3 in document 2, and the alignment result is obtained.

Fig. 6 is a schematic structural diagram of a parallel corpus processing apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus according to the embodiment of the present application includes: a sentence segmentation module 601, an encoding module 602, a segmentation module 603, and an alignment module 604. It should be noted here that the division of the first obtaining module 601, the second sending module 602, and the processing module 603 is only a division of logical functions, and the two may be integrated or independent physically. The parallel corpus processing device may be the processor 102 itself, or a chip or an integrated circuit that implements the functions of the processor 102. It should be noted here that the division of the sentence division module 601, the encoding module 602, the segmentation module 603, and the alignment module 604 is only a division of logical functions, and the two may be integrated or independent physically.

The sentence splitting module 601 is configured to perform sentence splitting operation on the target parallel corpus to obtain M sentences of original text documents and N sentences of translated text documents in the target parallel corpus;

the encoding module 602 is configured to encode the M sentences of original text and the N sentences of translated text to obtain a vector corresponding to each sentence of original text and a vector corresponding to each sentence of translated text;

the segmenting module 603 is configured to perform a segmenting operation on the target parallel corpus according to the obtained vector to obtain a plurality of bilingual inter-translation segments;

and an alignment module 604, configured to perform an alignment operation on each bilingual inter-translation segment to obtain an alignment result of the target parallel corpus.

Optionally, the segmentation module 603 is specifically configured to:

and calculating first similarity of M-N sentence pairs consisting of any one of the M sentences of original texts and any one of the N sentences of translated texts according to the obtained vector.

And determining a sentence pair for segmenting the document according to a first preset rule and the first similarity.

And carrying out segmentation operation on the target parallel corpus according to the sentence pairs of the segmented document to obtain a plurality of bilingual inter-translation segments.

Optionally, the alignment module 604 is specifically configured to:

and splicing the bilingual inter-translation segments into a plurality of sentence combinations according to the sequence of the target parallel linguistic data.

And coding the sentence combinations to obtain a vector corresponding to each combination.

And calculating a second similarity of the original text and the translated text in the plurality of sentence combinations according to the obtained vector.

Optionally, the segmentation module 603 is specifically configured to:

Optionally, before filtering the remaining sentence pairs of the M × N sentence pairs, the segmenting module 603 is further configured to:

and determining the relation between the sentences in the remaining sentence pairs.

And if the similarity of any one sentence in the to-be-processed sentence pairs in the rest sentence pairs is lower than the similarity of the same sentence in the other sentence pairs according to the relation, deleting the to-be-processed sentence pairs.

if the ratio of the words to the word number of a certain sentence pair exceeds a first preset ratio threshold value, deleting the sentence pair; or if the ratio of the segment length after a certain sentence pair is segmented according to the label exceeds a second preset ratio threshold, deleting the sentence pair; or, after any sentence in a certain sentence pair is spliced with the previous sentence or the next sentence in the original text, the similarity of the new sentence pair is calculated, and if the similarity score of the new sentence pair is higher, the sentence pair is deleted; or deleting the sentence pair if the score of the natural language understanding reference GLEU after the sentence pair is translated into the same language is lower than a first preset score; or deleting the sentence pair if the number of the intersection points generated by the sentence pair and other sentence pairs is larger than the preset intersection number.

Fig. 7 is a schematic structural diagram of a parallel corpus processing device according to an embodiment of the present application. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not limiting to the implementations of the present application described and/or claimed herein.

As shown in fig. 7, the parallel corpus processing apparatus includes: a processor 701 and a memory 702, each connected to each other using a different bus, and may be mounted on a common motherboard or in other manners as needed. The processor 701 may process instructions for execution within the parallel corpus processing device, including instructions for graphical information stored in or on a memory for display on an external input/output device (such as a display device coupled to an interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. In fig. 7, one processor 701 is taken as an example.

The memory 702, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of the parallel corpus processing apparatus in the embodiments of the present application (e.g., the clause module 601, the encoding module 602, the segmentation module 603, and the alignment module 604 shown in fig. 6). The processor 701 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 702, that is, implements the method of the parallel corpus processing apparatus in the above-described method embodiment.

The parallel corpus processing apparatus may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the parallel corpus processing apparatus, such as a touch screen, a keypad, a mouse, or multiple mouse buttons, a trackball, a joystick, or the like. The output means 704 may be an output device such as a display device of the parallel corpus processing device. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

The parallel corpus processing device according to the embodiment of the present application may be configured to execute the technical solutions according to the method embodiments of the present application, and the implementation principle and the technical effect are similar, which are not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer executable instruction is stored in the computer-readable storage medium, and the computer executable instruction is used for implementing any one of the above parallel corpus processing methods when executed by a processor.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Claims

1. A parallel corpus processing method is characterized by comprising the following steps:

2. The method according to claim 1, wherein said segmenting said target parallel corpus according to said obtained vector to obtain a plurality of bilingual inter-translations segments comprises:

3. The method according to claim 1, wherein said performing an alignment operation on each bilingual transliteration segment to obtain an alignment result of the target parallel corpus comprises:

4. The method according to claim 2, wherein the determining a sentence pair for segmenting the document according to a first preset rule and the first similarity comprises:

5. The method of claim 4, further comprising, prior to said screening the remaining sentence pairs of the M x N sentence pairs:

determining the relation between sentences in the remaining sentence pairs;

6. The method according to any one of claims 2 to 5, wherein the first preset rule comprises:

alternatively, the first and second electrodes may be,

deleting the sentence pair, splicing any sentence in a certain sentence pair with the previous sentence or the next sentence, recalculating the similarity score of the modified sentence pair, and deleting the sentence pair if the similarity score of the new sentence pair is high;

alternatively, the first and second electrodes may be,

if the score of the GLEU of the natural language understanding reference after a certain sentence pair is translated into the same language is lower than a preset score threshold, deleting the sentence pair;

alternatively, the first and second electrodes may be,

7. A parallel corpus processing apparatus, comprising:

8. The apparatus of claim 7, wherein the segmentation module is specifically configured to:

9. A parallel corpus processing apparatus, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when executed by a processor, the computer-executable instructions are used for implementing the parallel corpus processing method according to any one of claims 1 to 7.