CN112906371B

CN112906371B - Parallel corpus acquisition method, device, equipment and storage medium

Info

Publication number: CN112906371B
Application number: CN202110181644.5A
Authority: CN
Inventors: 张闯; 吴培昊
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2024-03-01
Anticipated expiration: 2041-02-08
Also published as: CN112906371A

Abstract

The embodiment of the disclosure discloses a parallel corpus acquisition method, device, equipment and storage medium. The method comprises the following steps: splitting a first text and a second text which are acquired in advance to obtain a first sentence list and a second sentence list, wherein the first text and the second text are of the same language and are used for describing the same content; determining a semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of a pair of N, N pairs of one-to-one pairs and one-to-one pairs, and N is an integer greater than or equal to 2; and acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus. The scheme determines the mapping relation among sentences based on the semantic similarity values among sentences, improves the accuracy of the associated sentence pairs, and further improves the accuracy of parallel corpus.

Description

Parallel corpus acquisition method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to natural language processing technology, in particular to a parallel corpus acquisition method, device, equipment and storage medium.

Background

The text simplification refers to the text containing hard words and complex sentence patterns, and the difficulty of the text is reduced by rewriting, so that people with low knowledge level or cognitive impairment can more easily understand and read the text. With the development of deep learning technology, end-to-end based neural network models are increasingly applied to text simplification. End-to-end neural network models typically require a large number of parallel corpora of complex sentences to simple sentences to train.

The traditional method for acquiring the parallel corpus mainly comprises a distance method, a method for solving the similarity between sentences based on TF-IDF vectors and a method based on word2vec vectors, but the parallel corpus cannot be accurately acquired.

BRIEF SUMMARY OF THE PRESENT DISCLOSURE

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for acquiring parallel corpus, which can improve the accuracy of the parallel corpus.

In a first aspect, an embodiment of the present disclosure provides a parallel corpus acquisition method, including:

splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, wherein the first text and the second text are in the same language and are used for describing the same content;

Determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix;

determining a mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of a pair of N, N pairs of one-to-one pairs and one-to-one pairs, and N is an integer greater than or equal to 2;

and acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus.

In a second aspect, an embodiment of the present disclosure further provides a parallel corpus acquisition device, including:

the splitting module is used for splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, wherein the first text and the second text are in the same language and are used for describing the same content;

the similarity value matrix determining module is used for determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix;

The mapping relation determining module is used for determining the mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of one pair N, N and one-to-one pair, and N is an integer greater than or equal to 2;

the parallel corpus acquisition module is used for acquiring a target second sentence associated with the first sentence according to the mapping relation, and recording the first sentence and the target second sentence as parallel corpus.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

the parallel corpus acquisition method as described in the first aspect is implemented when the one or more programs are executed by the one or more processors.

In a fourth aspect, an embodiment of the present disclosure further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the parallel corpus acquisition method according to the first aspect.

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for acquiring parallel corpus, wherein a first sentence list corresponding to a first text and a second sentence list corresponding to a second text are obtained by splitting the first text and the second text acquired in advance, and the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of a pair of N, N pairs of one-to-one pairs and one-to-one pairs, and N is an integer greater than or equal to 2; and acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus. The scheme determines the mapping relation among sentences based on the semantic similarity values among sentences, improves the accuracy of the associated sentence pairs, and further improves the accuracy of parallel corpus.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a flowchart of a parallel corpus acquisition method according to a first embodiment of the present disclosure;

fig. 2 is a flowchart of a parallel corpus acquisition method according to a second embodiment of the disclosure;

fig. 3 is a flowchart of a parallel corpus acquisition method according to a third embodiment of the present disclosure;

fig. 4 is a block diagram of a parallel corpus acquiring device according to a fourth embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device according to a fifth embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the concepts of "first", "second", etc. mentioned in this disclosure are only used to distinguish between different objects and are not intended to limit the order or interdependence of functions performed by these objects.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Example 1

Fig. 1 is a flowchart of a parallel corpus acquisition method according to an embodiment of the present disclosure, where the implementation may be applicable to a situation of acquiring parallel corpora. The parallel corpus is a sentence with a certain association, for example, a sentence with a higher similarity may be used. The method can be executed by a parallel corpus acquisition device, and the device can be realized in a software and/or hardware mode and can be configured in electronic equipment with a data processing function. As shown in fig. 1, the method may include the steps of:

s110, splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

The first text and the second text are in the same language and are used for describing the same content. The first text may be text containing an unintelligible vocabulary and complex sentence patterns, which may also be referred to as complex text, where such text is difficult. The second text may be text containing simple words and simple sentence patterns, which may also be referred to as simple text, and such text has low difficulty and is easy to understand by foreign language learners, people with low knowledge level or cognitive impairment. The first text and the second text in this embodiment are used to describe the same content, for example, the same object or the same event may be described, and the embodiment is not limited to a specific language type, for example, chinese, english, japanese, or the like, in the same language, that is, the language type of the first text and the second text is the same. Alternatively, the first text and the second text describing the same content may be obtained locally from a hierarchical reading website for storing texts of different difficulty levels.

The first sentence list is used for storing sentences obtained by splitting the first text, and the second sentence list is used for storing sentences obtained by splitting the second text. Alternatively, the first text and the second text may be split separately by a sentence splitting function in NLTK (Natural Language Toolkit, natural language processing toolkit). Of course, the first text and the second text may be split in other manners, and the embodiments are not limited. In order to distinguish each sentence obtained by splitting, the sentences can be optionally numbered according to the sequence of the sentences in the corresponding text, and the smaller the number is, the earlier the position of the sentence in the text is. The length of the first sentence list is the same as the number of sentences contained in the first text, and the length of the second sentence list is the same as the number of sentences contained in the second text. The number of sentences contained in the first text may be the same as or different from the number of sentences contained in the second text.

S120, determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtaining a similarity value matrix.

The first sentence is a sentence obtained by splitting the first text, and the second sentence is a sentence obtained by splitting the second text. The semantic similarity value is used to represent the semantic similarity between two sentences, and this embodiment is used to represent the semantic similarity between a first sentence and a second sentence. Alternatively, the semantic similarity between two sentences may be represented by a value between 0 and 5, where a smaller value indicates a lower semantic similarity between two sentences, for example, 0 indicates a lowest semantic similarity between two sentences, the semantics of two sentences may be considered to be completely different, and 5 indicates a highest semantic similarity between two sentences, or the semantics of two sentences may be considered to be completely the same. In the embodiment, the semantic similarity value among sentences is determined, and sentences with the same semantic information but larger vocabulary difference can be associated when parallel corpus is acquired later, so that the accuracy of the parallel corpus is improved. Alternatively, the first sentence and the second sentence may be input into the neural network model, and the semantic similarity value between the first sentence and the second sentence may be output by the neural network model. The embodiment does not limit the specific structure of the neural network model, and for example, a deep semantic model (Deep Structured Sematic models, DSSM) or a Text-to-Text conversion model (T5 model) may be used. Of course, the semantic similarity value between the first sentence and the second sentence may also be determined in other manners, and the embodiment is not limited.

The similarity value matrix is used for storing the semantic similarity value between the first sentences and the second sentences, alternatively, the semantic similarity value between each first sentence and each second sentence can be stored in a row unit, namely, each row of the similarity value matrix represents one first sentence, each column of the similarity value matrix represents one second sentence, namely, the number of rows of the similarity value matrix is equal to the number of the first sentences contained in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of the second sentences contained in the second sentence list. For example, the similarity value matrix is expressed as t=t _xy X=1, 2,., m, y=1, 2,., n, m is the number of first sentences, n is the number of second sentences, then t ₂₃ Representing a semantic similarity value between a second first sentence in the first sentence list and a third second sentence in the second sentence list.

S130, determining the mapping relation between the first statement and the second statement according to the similarity value matrix.

Wherein the mapping relationship includes at least one of a pair N, N to one and a pair to one, and N is an integer greater than or equal to 2. A pair of N indicates that a first sentence is associated with a plurality of second sentences, and N indicates that a plurality of first sentences are associated with a second sentence, one-to-one indicates that a first sentence is associated with a second sentence. Alternatively, the mapping relationship between the first sentence and the second sentence may be determined according to the semantic similarity value, for example, when the semantic similarity value is equal to a set threshold, the mapping relationship between the first sentence and the second sentence corresponding to the semantic similarity value is considered to be one-to-one, and the set threshold is used for indicating that the semantic similarity between the first sentence and the second sentence is the highest, for example, may be 5, that is, when the semantic similarity value between the first sentence and the second sentence is 5, the mapping relationship between the first sentence and the second sentence is considered to be one-to-one.

When the semantic similarity value is less than the set threshold, in one example, a mapping relationship between the first sentence and the second sentence may be determined in combination with the semantic similarity value between the first sentence and the other second sentences and the semantic similarity value between the other first sentence and the second sentence. For example, a first sentence is considered to be associated with a plurality of different second sentences if the difference in semantic similarity values between the first sentence and the different second sentences is less than or equal to a preset difference. The magnitude of the preset difference may be set according to practical situations, for example, may be set to 0.1. For example, the semantic similarity value between the second first sentence in the first sentence list and the third second sentence in the second sentence list is 2, the semantic similarity value between the second first sentence and the fourth second sentence in the second sentence list is 2.1, the semantic similarity value between the second sentence in the second sentence list is 0.5, and the semantic similarity value between the third second sentence and the other first sentences in the first sentence list is less than 1, then the mapping relationship between the second first sentence in the first sentence list and the third second sentence in the second sentence list is considered as a pair of two, namely, the second first sentence in the first sentence list is associated with the third second sentence and the fourth second sentence in the second sentence list.

When the semantic similarity value is smaller than the set threshold value, in one example, a plurality of first sentences or second sentences may be combined, and the mapping relationship between the first sentences and the second sentences may be determined based on the semantic similarity value between the combined sentences. For example, when the semantic similarity value between the third first sentence in the first sentence list and the first second sentence in the second sentence list is smaller than 5, the first second sentence may be merged into the second sentence, and the third first sentence and the fourth first sentence may be merged, and if the semantic similarity value between the third first sentence and the merged second sentence > the semantic similarity value between the first sentence and the first second sentence > the semantic similarity value between the fourth first sentence and the second sentence, the mapping relationship between the third first sentence in the first sentence list and the first second sentence in the second sentence list is considered as a pair of N, and if the semantic similarity value between the merged first sentence and the first second sentence > the semantic similarity value between the third first sentence and the merged second sentence > the semantic similarity value between the fourth first sentence and the second sentence, the mapping relationship between the third first sentence and the first sentence in the first sentence list is considered as a pair of N. Of course, the mapping relationship between the first sentence and the second sentence may be determined in other manners, and the embodiment is not limited.

It should be noted that, instead of the above-mentioned one-to-one, one-to-N, or N-to-one, the mapping between the first sentence and the second sentence may be an N-to-N, N-to-N mapping obtained by analyzing one-to-N or N-to-one mapping between two sentences. For example, a second first sentence is associated with a third second sentence and a fourth second sentence, a third first sentence is associated with a third second sentence and a fourth second sentence, and then the second sentence and the third sentence in the first text are considered to be associated with the third sentence and the fourth sentence in the second text, i.e., 2-to-2. The nth first sentence is the nth sentence of the first text, and similarly, the nth second sentence is the nth sentence of the second text.

S140, acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus.

The target second sentence is a sentence related to the first sentence, and may be a single second sentence or a sentence obtained by combining a plurality of second sentences. Specifically, if the mapping relationship between the first sentence and the second sentence is one-to-one, the first sentence and the second sentence corresponding to the mapping relationship may be associated as a set of parallel corpus, and the second sentence is referred to as a target second sentence; if the mapping relation between the first sentence and the second sentence is a pair of N, the N second sentences can be combined, the first sentence and the combined sentences are associated to be used as a group of parallel corpus, and the target second sentence is the sentence after the N second sentences are combined; if the mapping relation between the first sentence and the second sentence is N-to-one, the N first sentences can be combined, and the combined first sentence and the second sentence are associated to be used as a group of parallel corpus, and the target second sentence is a single second sentence. If the plurality of first sentences and the plurality of second sentences are associated, the plurality of first sentences and the plurality of second sentences can be combined, the combined sentences are associated to be used as a group of parallel corpus, and the target second sentences are the sentences combined by the plurality of second sentences.

According to the method and the device, the semantic similarity values among the sentences are determined based on the semantic information of the sentences, and the mapping relation among the sentences is determined according to the semantic similarity values, so that the accuracy of associated sentence pairs is improved, and the accuracy of parallel corpus is further improved. The accuracy of the text simplified model can be improved when the text simplified model is trained by using parallel corpus, and the accuracy of the conversion result can be improved when the complex text is converted into the simple text by using the trained text simplified model.

The first embodiment of the disclosure provides a parallel corpus obtaining method, which obtains a first sentence list corresponding to a first text and a second sentence list corresponding to a second text by splitting the first text and the second text which are obtained in advance, wherein the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of a pair of N, N pairs of one-to-one pairs and one-to-one pairs, and N is an integer greater than or equal to 2; and acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus. The scheme determines the mapping relation among sentences based on the semantic similarity values among sentences, improves the accuracy of the associated sentence pairs, and further improves the accuracy of parallel corpus.

Example two

Fig. 2 is a flowchart of a parallel corpus acquisition method provided in a second embodiment of the present disclosure, where the optimization is performed based on the foregoing embodiment, and referring to fig. 2, the method may include the following steps:

s210, splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

S220, inputting a first sentence in the first sentence list and a second sentence in the second sentence list into a semantic similarity value model, and outputting the semantic similarity value of the first sentence and the second sentence by the semantic similarity value model.

The semantic similarity value model is obtained through training of sentence pairs with different semantic similarity values. The semantic similarity value model is used for determining the semantic similarity value between any two sentences later, and the semantic similarity value model in the embodiment takes a T5 model as an example. The T5 model may be trained prior to application. Alternatively, sentence pairs with different semantic similarity values can be obtained from the public dataset STS-B as training samples. The public dataset STS-B is used to store pairs of sentences having different semantic similarity values. The semantic similarity value may be represented by a number between 0 and 5, where 0 may represent that the semantics between two statements are completely different; 1 may represent a semantic difference between two statements, but the subject matter of the description is consistent; 2 may represent a semantic difference between two statements, but a small portion of the information is consistent; 3 may represent that the semantics between two sentences are substantially consistent, but that there is some inconsistent or missing important information; 4 may represent that the semantics between two sentences are very similar, but there is some incongruity of unimportant information; 5 may represent that the semantics between two statements are identical.

According to the embodiment, sentence pairs with different semantic similarity values are used as training samples, a semantic similarity value model is input, and the semantic similarity value model is trained, so that the trained semantic similarity value model can determine the semantic similarity value between any two sentences. Optionally, for each first sentence, inputting one second sentence in the first sentence and the second sentence list into the trained semantic similarity value model, and sequentially determining the semantic similarity value between the first sentence and each second sentence; the first sentence and all the second sentences can be input into a trained semantic similarity value model, and the semantic similarity value between the first sentence and each second sentence is determined; the second sentences and all the first sentences can be input into a trained semantic similarity value model for each second sentence, and the semantic similarity value between the second sentences and each first sentence is determined; and the method can also input all the first sentences and all the second sentences into the trained semantic similarity value model, and simultaneously determine the semantic similarity value between each first sentence and each second sentence, so that the efficiency can be improved. According to the method and the device for determining the semantic similarity values among the sentences, when the parallel corpus is determined later, the sentences with the same semantic information and larger vocabulary difference can be accurately associated, and the accuracy of the parallel corpus is improved. In addition, for complex sentences with large syntactic changes and more vocabulary deletions, simple sentences can be accurately associated.

S230, sequentially arranging the semantic similarity values corresponding to the first sentences to obtain a similarity value matrix.

The number of lines of the similarity value matrix is equal to the number of first sentences contained in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of second sentences contained in the second sentence list. For example, the similarity value matrix may be represented by T _n ^m Representing semantic similarity values between m first sentences and n second sentences, m and n being the length of the first sentence list and the length of the second sentence list, respectively, e.g. T _n 1 represents a semantic similarity value between a first sentence of a first text and each sentence of a second text, T ₁ ^m A semantic similarity value between a first sentence representing the second text and each sentence of the first text.

S240, determining the mapping relation between the first statement and the second statement according to the similarity value matrix.

S250, acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus.

S260, inputting the parallel corpus into a text simplified model, and training the text simplified model to obtain a target text simplified model.

The target text simplification model is used for converting complex text into simple text. After the parallel corpus is determined, the first sentence can be input into a text simplified model, a prediction sentence is output by the text simplified model, and parameters of the text simplified model are adjusted according to the deviation of the prediction sentence and the second sentence until the deviation of the prediction sentence and the second sentence meets the set condition, so that a target text simplified model is obtained, and a complex text can be input into the target text simplified model, a simple text is output by the target text simplified model, and the conversion from the complex text to the simple text is realized.

The second embodiment of the present disclosure provides a parallel corpus obtaining method, based on the foregoing embodiments, by using a sentence pair with different semantic similarity values to train a semantic similarity value model, and using the trained semantic similarity value model to determine semantic similarity values between sentences, a similarity value matrix is obtained, and further, a mapping relationship between sentences is determined according to the similarity value matrix, so as to obtain associated sentence pairs, thereby improving accuracy of the associated sentence pairs.

Example III

Fig. 3 is a flowchart of a parallel corpus acquisition method according to a third embodiment of the present disclosure, where the optimization is performed based on the foregoing embodiment, and referring to fig. 3, the method may include the following steps:

S310, splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

S320, determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list, and obtaining a similarity value matrix.

S330, the position of the first element of the similarity value matrix is recorded as the current element position.

Assuming that the similarity matrix is m rows and n columns, the first element may be an element corresponding to the first column of the first row of the similarity matrix, the second element may be an element corresponding to the second column of the first row of the similarity matrix, the (n+1) th element may be an element corresponding to the first column of the second row, and so on. The present embodiment performs a similar process for each element in the similarity matrix, where the first element is taken as the current element, and the position of the first element is taken as the current element position, and the process that needs to be performed is described. Each element position corresponds to a semantic similarity value, where the semantic similarity value is a semantic similarity value between a first sentence and a second sentence corresponding to the element position, for example, the semantic similarity value corresponding to the element position of a third column of a second row is a semantic similarity value between a second sentence in a first text and a third sentence in a second text, and may also be referred to as a semantic similarity value between a second first sentence of the first text and a third second sentence of the second text.

S340, whether the semantic similarity value corresponding to the current element position is equal to a first preset value or not, if so, executing S350, otherwise, executing S360.

The first preset value is used for representing that the semantic similarity degree of the first sentence and the second sentence corresponding to the current element position is highest. The first preset value is related to a semantic similarity value of a training sample adopted by the training semantic similarity value model, for example, the semantic similarity value of the training sample adopted when the semantic similarity value model is trained is between 0 and 5, and the first preset value can be 5 so as to represent that the semantic similarity degree of two sentences is highest, that is, the semantic similarity value corresponding to the current element position is either equal to 5 or smaller than 5.

S350, determining that the mapping relation between the first statement and the second statement corresponding to the current element position is one-to-one.

Specifically, if the semantic similarity value between the first sentence and the second sentence corresponding to the current element position is 5, the semantics between the first sentence and the second sentence are considered to be identical, and it can be determined that the mapping relationship between the first sentence and the second sentence corresponding to the current element position is one-to-one. Then, S380 is executed, and the next element position is taken as the current element position, and S340 is executed back, so as to continue to determine the mapping relationship between the first sentence and the second sentence corresponding to the current element position.

S360, merging the second sentence corresponding to the current element position with the second sentence corresponding to the next element position to obtain a first merged sentence; and merging the first sentence corresponding to the current element position with the first sentence corresponding to the next element position to obtain a second merged sentence.

Considering that the parallel corpus is two sentences with higher semantic similarity, when the semantic similarity value corresponding to the current element position is smaller than 5 and larger than a set threshold value, whether the first sentence corresponding to the current element position and the second sentence are in a one-to-N or N-to-one relation can be further judged. For example, a certain number of sentences may be merged, and whether a relationship between the first sentence and the second sentence corresponding to the current element position is one-to-N or N-to-one is determined according to the semantic similarity value between the merged sentences. The size of the set threshold may be determined according to practical situations, for example, when the semantic similarity value is 3, meaning that the semantics of the two sentences are substantially the same, and thus the set threshold may be set to 3 or a value around 3.

Alternatively, a second sentence corresponding to the current element position and a second sentence corresponding to the next element position may be combined to obtain a first combined sentence S (t _y :t _y +1)，t _y A second sentence corresponding to the current element position, t _y +1 represents a second sentence corresponding to the next element position, y=1, 2. Similarly, a first sentence corresponding to the current element position and a first sentence corresponding to the next element position can be combined to obtain a second combined sentence C (t _x :t _x +1)，t _x A first sentence corresponding to the current element position, t _x +1 represents a first sentence corresponding to the next element position, x=1, 2.

S370, determining a mapping relation between the first sentence corresponding to the current element position and the second sentence according to the semantic similarity value between the first sentence corresponding to the current element position and the first merged sentence, the semantic similarity value between the second merged sentence and the second sentence corresponding to the current element position, and the semantic similarity value corresponding to the next element position.

Alternatively, a first sentence t corresponding to the current element position may be determined _x And the first merge statement S (t _y :t _y Semantic similarity value sim between +1) ₁ A second merged sentence C (t _x :t _x +1) a second sentence t corresponding to the current element position _y Semantic similarity value sim ₂ According to sim ₁ 、sim ₂ And determining a first statement t corresponding to the current element position by using the semantic similarity value corresponding to the next element position _x And the second sentence t _y Mapping relation between the two. For convenience of description, the first sentence t corresponding to the current element position may be _x And the first merge statement S (t _y :t _y +1) semantic similarity value sim ₁ Recorded as a first semantically similar value, a second merged sentence C (t _x :t _x +1) a second sentence t corresponding to the current element position _y Semantic similarity value sim ₂ The second semantic similarity value, the semantic similarity value corresponding to the current element position, and the semantic similarity value corresponding to the next element position are respectively recorded as a third semantic similarity value and a fourth semantic similarity value.

In one example, the mapping relationship between the first statement and the second statement may be determined by:

if the first semantic similarity value is smaller than or equal to the third semantic similarity value or the first semantic similarity value is smaller than or equal to the fourth semantic similarity value, reducing the first semantic similarity value, otherwise, keeping the first semantic similarity value unchanged;

if the second semantic similarity value is smaller than or equal to the third semantic similarity value or the second semantic similarity value is smaller than or equal to the fourth semantic similarity value, reducing the second semantic similarity value, otherwise, keeping the second semantic similarity value unchanged;

Determining a maximum value of the first semantic similarity value, the second semantic similarity value and the fourth semantic similarity value;

if the maximum value is the fourth semantic similarity value, determining that the mapping relation between the first sentence and the second sentence corresponding to the current element position is one-to-one; if the maximum value is the first semantic similarity value, determining that the mapping relation between the first sentence and the second sentence corresponding to the current element position is a pair of N; and if the maximum value is the second semantic similarity value, determining that the mapping relation between the first sentence and the second sentence corresponding to the current element position is N-to-one.

Specifically, if the first semantic similarity value sim ₁ Less than or equal to the third semantic similarity value, or the first semantic similarity value sim ₁ Smaller than or equal to the fourth semantic similarity value, then the first semantic similarity value sim is reduced ₁ Otherwise, maintaining the first semantically similar value sim ₁ Unchanged, then compare the second semantic similarity value sim ₂ Relationship with third and fourth semantic similarity values, e.g. if the second semantic similarity value sim ₂ Less than or equal to the third semantic similarity value, or the second semantic similarity value sim ₂ Smaller than or equal to the fourth semantic similarity value, then decreasing the second semantic similarity value sim ₁ Otherwise, maintaining the second semantically similar value sim ₂ Is unchanged. Of course, the second semantic similarity value sim can also be compared first ₂ The relation between the third semantic similarity value and the fourth semantic similarity value is compared with the first semantic similarity value sim ₁ The process is similar to the relationship of the third semantic similarity value and the fourth semantic similarity value. On the basis, the first semantic similarity value sim is compared ₁ Second semantic similarity value sim ₂ And the fourth semantic similarity value, where the first semantic similarity value sim ₁ Second semantic similarity value sim ₂ For executing the value after the adjustment, if the comparison result is the first semantically similar value sim ₁ Maximum, consider the mapping relation between the first sentence and the second sentence of the current element position as a pair of N, if the comparison result is the second semantic similarity value sim ₂ And if the comparison result is that the fourth semantic similarity value is the largest, the mapping relationship between the first sentence and the second sentence of the current element position is considered to be one-to-one.

In the above process, when the first semantic similarity value sim needs to be reduced ₁ Or a second semantic similarity value sim ₂ In this case, the embodiment does not limit the specific amount of scaling down, so long as the following comparison of the first semantic similarity value sim is ensured ₁ Second semantic similarity value sim ₂ And the fourth semantic similarity value is differentiated, so that the mapping relation between the first sentence and the second sentence can be accurately determined. For example when it is desired to reduce the first semantically similar value sim ₁ In this case, the first semantic similarity value sim can be directly obtained ₁ Set to 0, when the second semantic similarity value sim needs to be reduced ₂ In this case, the second semantic similarity value sim can be directly obtained ₂ Set to 0.

And further determining the size of N when the mapping relation between the first statement and the second statement of the current element position is determined to be one pair of N or N pair of one. For example, when determining that the mapping relationship between the first sentence and the second sentence corresponding to the current element position is a pair of N, the size of N may be determined as follows:

merging the second sentence corresponding to the current element position into the second sentence corresponding to the target element position to obtain a third merged sentence, wherein the column where the target element position is located is the sum of the column where the current element position is located and N-1, and N=3;

if the semantic similarity value of the first sentence corresponding to the current element position and the semantic similarity value of the third combined sentence are smaller than or equal to the semantic similarity value corresponding to the current element position, determining that n=2;

And if not, enabling N=N+1, and repeatedly executing the operation until the semantic similarity value of the first sentence corresponding to the obtained current element position and the semantic similarity value of the third combined sentence are smaller than or equal to the semantic similarity value corresponding to the current element position, and determining N=N-1.

A pair of N represents a plurality of second sentences corresponding to one first sentence, N is greater than or equal to 2, at this time, N can be added with 1 first, namely whether N is 3 is determined, for example, three second sentences can be combined, namely, the second sentence corresponding to the current element position, the second sentence corresponding to the next element position and the second sentence corresponding to the next element position can be combined to obtain a third combined sentence, and note that the element positions of the three combined second sentences correspond to the same first sentence. And then determining a semantic similarity value between the first sentence and the third combined sentence corresponding to the current element position, stopping searching if the semantic similarity value between the first sentence and the third combined sentence corresponding to the current element position is smaller than or equal to the semantic similarity value corresponding to the current element position, determining N=2, adding 1 if the semantic similarity value between the first sentence and the third combined sentence corresponding to the current element position is larger than the semantic similarity value corresponding to the current element position, and continuing to judge until the obtained semantic similarity value between the first sentence and the third combined sentence corresponding to the current element position is smaller than or equal to the semantic similarity value corresponding to the current element position, wherein N=N-1.

According to the method, the semantic similarity value after the first sentence and the plurality of second sentences are combined, or the semantic similarity value between the sentences after the plurality of first sentences are combined and the second sentences is further determined on the basis of the similarity value matrix, the mapping relation between the first sentences and the second sentences is determined on the basis of the semantic similarity value, the accuracy of the associated sentence pairs is improved, sentences with the same or similar semantics but larger vocabulary difference can be accurately associated, and complex sentences with large syntactic changes and more vocabularies deleted can be associated to simple sentences, so that the number of parallel corpus is increased. For example, "You should also be careful when taking selfies" and "Think before you take a selfie", "It's more comfortable high up in its clouds" and "But up in the clouds, it's calmer" and "The surface of Venus has burning temperatures and crushing pressures" and "It is very hot and has strong pressures" can be accurately correlated by the present embodiment.

When the mapping relation between the first sentence and the second sentence corresponding to the current element position is determined to be N pairs of ones, the size of N can be determined by the following method:

Merging the first sentence corresponding to the current element position into the first sentence corresponding to the target element position to obtain a fourth merged sentence, wherein the column where the target element position is located is the sum of the column where the current element position is located and N-1, and N=3;

if the semantic similarity value of the fourth combined sentence and the second sentence corresponding to the current element position is smaller than or equal to the semantic similarity value corresponding to the current element position, determining that n=2;

and if not, enabling N=N+1, and repeatedly executing the operation until the semantic similarity value of the obtained fourth combined sentence and the second sentence corresponding to the current element position is smaller than or equal to the semantic similarity value corresponding to the current element position, and determining N=N-1.

When the mapping relationship between the first sentence and the second sentence corresponding to the current element position is N versus one, the determining process of the size of N is similar to that when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is a pair of N, and the determining process of the size of N is not repeated here.

And S380, judging whether the current element position is the position of the last element in the similarity value matrix, if so, executing S3100, otherwise, executing S390, and returning to execute S340.

S390, the position of the next element is recorded as the current element position.

S3100, acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus.

And after the element position traversal in the similarity value matrix is finished, acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus.

S3110, inputting the parallel corpus into a text simplified model, and training the text simplified model to obtain a target text simplified model.

In one example, in order to facilitate statistics of parallel corpora, after obtaining a similarity value matrix, a sentence association matrix P may be initialized synchronously to store whether a first sentence is associated with a second sentence, where an initial value of each element in the sentence association matrix P is 0, a row and a column of the sentence association matrix P are the same as those of the similarity value matrix T, and a value corresponding to a certain element position of the sentence association matrix P indicates whether the first sentence and the second sentence in the same element position in the similarity value matrix T are associated, and then the parallel corpora may be directly obtained according to the values of each element position in the sentence association matrix P. For example, when it is determined that the mapping relationship between the second first sentence of the first text and the third second sentence of the second text is one-to-one, the second row and the third column of the sentence association matrix P may be set to 1 synchronously to represent that the first sentence of the position is associated with the second sentence. For another example, when it is determined that the second first sentence of the first text is associated with the second to fourth second sentences of the second text, the second row, the second column, the second row, and the fourth column of the sentence association matrix P may be set to 1 simultaneously. For another example, when it is determined that the second to fifth first sentences of the first text are associated with the third second sentence of the second text, the second row third column, the third row third column, the fourth row third column, and the fifth row third column of the sentence association matrix P may be set to 1 simultaneously. After the traversal is finished, the sentence can be taken out according to the element position of 1 in the sentence association matrix P, and sentences corresponding to the element position are combined into complex sentences-simple sentences, so that parallel corpus is obtained.

The third embodiment of the present disclosure provides a parallel corpus acquisition method, which determines semantic similarity values between sentences based on the above embodiments, determines mapping relationships between sentences according to the semantic similarity values, improves accuracy of associated sentence pairs, and also enables some sentences with the same or similar semantics but larger vocabulary difference to be accurately associated with each other, and some complex sentences with large syntactic changes and more vocabularies to be associated with simple sentences, thereby increasing the number of parallel corpora.

Example IV

Fig. 4 is a block diagram of a parallel corpus acquisition device according to a fourth embodiment of the present disclosure, where the device may execute the parallel corpus acquisition method described in the foregoing embodiment, as shown in fig. 4, and the device may include:

the splitting module 41 is configured to split a first text and a second text obtained in advance, so as to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, where the first text and the second text are in the same language, and are used for describing the same content;

a similarity matrix determining module 42, configured to determine a semantic similarity between each first sentence in the first sentence list and each second sentence in the second sentence list, so as to obtain a similarity matrix;

A mapping relationship determining module 43, configured to determine a mapping relationship between the first sentence and the second sentence according to the similarity value matrix, where the mapping relationship includes at least one of a pair N, N pair one-to-one and a pair one-to-one, and N is an integer greater than or equal to 2;

the parallel corpus obtaining module 44 is configured to obtain a target second sentence associated with the first sentence according to the mapping relationship, and record the first sentence and the target second sentence as parallel corpus.

The fourth embodiment of the present disclosure provides a parallel corpus acquiring device, which obtains a first sentence list corresponding to a first text and a second sentence list corresponding to a second text by splitting the first text and the second text acquired in advance, where the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of a pair of N, N pairs of one-to-one pairs and one-to-one pairs, and N is an integer greater than or equal to 2; and acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus. The scheme determines the mapping relation among sentences based on the semantic similarity values among sentences, improves the accuracy of the associated sentence pairs, and further improves the accuracy of parallel corpus.

Based on the above embodiment, the similarity value matrix determining module 42 is specifically configured to:

inputting a first sentence in the first sentence list and a second sentence in the second sentence list into a semantic similarity value model, outputting semantic similarity values of the first sentence and the second sentence by the semantic similarity value model, and training the semantic similarity value model through sentence pairs with different semantic similarity values;

sequentially arranging semantic similarity values corresponding to the first sentences to obtain a similarity value matrix, wherein the number of lines of the similarity value matrix is equal to the number of the first sentences contained in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of the second sentences contained in the second sentence list.

On the basis of the above embodiment, the mapping relation determining module 43 is specifically configured to:

the position of the first element of the similarity value matrix is recorded as the current element position;

if the semantic similarity value corresponding to the current element position is equal to a first preset value, determining that the mapping relation between the first sentence corresponding to the current element position and the second sentence is one-to-one, wherein the first preset value is used for indicating that the semantic similarity degree of the first sentence corresponding to the current element position and the second sentence is highest;

The position of the next element is recorded as the current element position, and the above operation is repeatedly performed.

if the semantic similarity value corresponding to the current element position is smaller than a first preset value, merging a second sentence corresponding to the current element position with a second sentence corresponding to the next element position to obtain a first merged sentence, wherein the first preset value is used for indicating that the semantic similarity degree of the first sentence corresponding to the current element position and the second sentence is highest; merging the first sentence corresponding to the current element position with the first sentence corresponding to the next element position to obtain a second merged sentence;

determining a mapping relation between the first sentence corresponding to the current element position and the second sentence according to the semantic similarity value between the first sentence corresponding to the current element position and the first merged sentence, the semantic similarity value between the second merged sentence and the second sentence corresponding to the current element position and the semantic similarity value corresponding to the next element position;

On the basis of the above embodiment, the semantic similarity value of the first sentence corresponding to the current element position and the first merged sentence is recorded as a first semantic similarity value, the semantic similarity value of the second sentence corresponding to the second merged sentence and the current element position is recorded as a second semantic similarity value, the semantic similarity value corresponding to the current element position is recorded as a third semantic similarity value, and the semantic similarity value corresponding to the next element position is recorded as a fourth semantic similarity value;

the mapping relation determining module 43 is specifically configured to:

On the basis of the above embodiment, when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is a pair of N, the determining process of the size of N is as follows:

On the basis of the above embodiment, when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is N to one, the determining process of the size of N is as follows:

Based on the above embodiment, the parallel corpus obtaining module 44 is specifically configured to:

if the mapping relation is one-to-one, marking a second sentence corresponding to the mapping relation as a target second sentence;

if the mapping relation is a pair of N, merging second sentences corresponding to the mapping relation to obtain target second sentences;

and if the mapping relation is N-to-one, merging the first sentences corresponding to the mapping relation, and marking the second sentences corresponding to the mapping relation as target second sentences.

On the basis of the above embodiment, the apparatus may further include:

the training module is used for inputting the parallel corpus into a text simplified model after the first sentence and the target second sentence are marked as parallel corpus, training the text simplified model to obtain a target text simplified model, and the target text simplified model is used for converting a complex text into a simple text.

The parallel corpus acquisition device provided by the embodiment of the present disclosure belongs to the same concept as the parallel corpus acquisition method provided by the above embodiment, and technical details which are not described in detail in the present embodiment can be seen in the above embodiment, and the present embodiment has the same beneficial effects of executing the parallel corpus acquisition method.

Example five

Referring now to fig. 5, a schematic diagram of a configuration of an electronic device) 500 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

Example six

The computer readable medium described above in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, wherein the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relation between the first sentence and the second sentence according to the similarity value matrix, wherein the mapping relation comprises at least one of a pair of N, N pairs of one-to-one pairs and one-to-one pairs, and N is an integer greater than or equal to 2; and acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the module is not limited to the module itself in some cases, for example, the splitting module may also be described as a module for splitting a first text and a second text acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, the present disclosure provides a parallel corpus acquisition method, including:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, determining a semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, to obtain a similarity value matrix includes:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, the determining, according to the similarity matrix, a mapping relationship between the first sentence and the second sentence includes:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, a semantic similarity value between a first sentence corresponding to the current element position and the first merged sentence is denoted as a first semantic similarity value, a semantic similarity value between a second merged sentence corresponding to the current element position and the second sentence is denoted as a second semantic similarity value, a semantic similarity value corresponding to the current element position is denoted as a third semantic similarity value, and a semantic similarity value corresponding to the next element position is denoted as a fourth semantic similarity value;

the determining the mapping relationship between the first sentence and the second sentence corresponding to the current element position according to the semantic similarity between the first sentence and the first merged sentence corresponding to the current element position, the semantic similarity between the second merged sentence and the second sentence corresponding to the current element position, and the semantic similarity corresponding to the next element position includes:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, when a mapping relationship between a first sentence and a second sentence corresponding to the current element position is a pair of N, a determining process of a size of N is as follows:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, when a mapping relationship between a first sentence and a second sentence corresponding to the current element position is N versus one, a determining process of a size of N is as follows:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, the obtaining, according to the mapping relationship, a target second sentence associated with the first sentence includes:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, after the first sentence and the target second sentence are recorded as parallel corpora, the method further includes:

and inputting the parallel corpus into a text simplified model, training the text simplified model to obtain a target text simplified model, wherein the target text simplified model is used for converting a complex text into a simple text.

According to one or more embodiments of the present disclosure, the present disclosure provides a parallel corpus acquisition device, including:

According to one or more embodiments of the present disclosure, the present disclosure provides an electronic device comprising:

One or more processors;

a memory for storing one or more programs;

the parallel corpus acquisition method as described in any of the present disclosure is implemented when the one or more programs are executed by the one or more processors.

According to one or more embodiments of the present disclosure, the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a parallel corpus acquisition method as described in any of the present disclosure.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. The parallel corpus acquisition method is characterized by comprising the following steps of:

determining a semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix, wherein the semantic similarity value is used for representing the semantic similarity degree between the two sentences;

Acquiring a target second sentence associated with the first sentence according to the mapping relation, and marking the first sentence and the target second sentence as parallel corpus;

the determining the mapping relation between the first sentence and the second sentence according to the similarity value matrix includes:

2. The method of claim 1, wherein determining the semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain the similarity value matrix comprises:

3. The method of claim 1, wherein the determining the mapping relationship between the first sentence and the second sentence according to the similarity value matrix comprises:

4. A method according to claim 3, wherein the semantic similarity value of the first sentence corresponding to the current element position and the first merged sentence is denoted as a first semantic similarity value, the semantic similarity value of the second sentence corresponding to the second merged sentence and the current element position is denoted as a second semantic similarity value, the semantic similarity value corresponding to the current element position is denoted as a third semantic similarity value, and the semantic similarity value corresponding to the next element position is denoted as a fourth semantic similarity value;

5. The method of claim 4, wherein when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is a pair of N, the size of N is determined as follows:

6. The method of claim 4, wherein when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is N to one, the size of N is determined as follows:

7. The method of claim 1, wherein the obtaining, according to the mapping relationship, a target second sentence associated with the first sentence comprises:

8. The method of any of claims 1-7, further comprising, after scoring the first sentence and the target second sentence as parallel corpora:

9. The utility model provides a parallel corpus acquisition device which characterized in that includes:

the similarity value matrix determining module is used for determining a semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix, wherein the semantic similarity value is used for representing the semantic similarity degree between the two sentences;

the parallel corpus acquisition module is used for acquiring a target second sentence associated with the first sentence according to the mapping relation and marking the first sentence and the target second sentence as parallel corpus;

the mapping relation determining module is specifically configured to:

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

the parallel corpus acquisition method according to any of claims 1-8, when said one or more programs are executed by said one or more processors.

11. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a parallel corpus acquisition method according to any of claims 1-8.