CN112906371A

CN112906371A - Parallel corpus acquisition method, device, equipment and storage medium

Info

Publication number: CN112906371A
Application number: CN202110181644.5A
Authority: CN
Inventors: 张闯; 吴培昊
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-04
Anticipated expiration: 2041-02-08
Also published as: CN112906371B

Abstract

The embodiment of the disclosure discloses a parallel corpus acquisition method, a device, equipment and a storage medium. The method comprises the following steps: splitting a first text and a second text which are acquired in advance to obtain a first sentence list and a second sentence list, wherein the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relation between the first statement and the second statement according to the similarity value matrix, wherein the mapping relation comprises at least one of one pair N, N and one pair, and N is an integer greater than or equal to 2; and acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora. According to the scheme, the mapping relation between the sentences is determined based on the semantic similarity value between the sentences, so that the accuracy of the associated sentence pairs is improved, and the accuracy of the parallel linguistic data is further improved.

Description

Parallel corpus acquisition method, device, equipment and storage medium

Technical Field

The present disclosure relates to natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for obtaining parallel corpora.

Background

The text simplification means that the text containing difficult words and complex sentence patterns is rewritten to reduce the difficulty of the text, so that people with low knowledge level or cognitive impairment can understand and read more easily. With the development of deep learning technology, the end-to-end based neural network model is applied more and more in text simplification. End-to-end neural network models typically require a large number of parallel corpora of complex to simple sentences to train.

The traditional method for obtaining the parallel linguistic data mainly comprises a distance method, a method for solving the similarity between sentences based on a TF-IDF vector and a method based on a word2vec vector, but the parallel linguistic data cannot be accurately obtained.

BRIEF SUMMARY OF THE PRESENT DISCLOSURE

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for acquiring parallel corpora, which can improve the accuracy of the parallel corpora.

In a first aspect, an embodiment of the present disclosure provides a parallel corpus obtaining method, including:

splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, wherein the first text and the second text are in the same language and are used for describing the same content;

determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix;

determining a mapping relationship between the first statement and the second statement according to the similarity value matrix, wherein the mapping relationship comprises at least one of one pair N, N and one pair, and N is an integer greater than or equal to 2;

and acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora.

In a second aspect, an embodiment of the present disclosure further provides a parallel corpus acquiring apparatus, including:

the system comprises a splitting module, a first obtaining module and a second obtaining module, wherein the splitting module is used for splitting a first text and a second text which are obtained in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, and the first text and the second text are in the same language and are used for describing the same content;

a similarity value matrix determining module, configured to determine a semantic similarity value between each first sentence in the first sentence list and each second sentence in the second sentence list, to obtain a similarity value matrix;

a mapping relation determining module, configured to determine, according to the similarity value matrix, a mapping relation between the first statement and the second statement, where the mapping relation includes at least one of a pair N, N, i.e., one to one, and a pair, i.e., N is an integer greater than or equal to 2;

and the parallel corpus acquiring module is used for acquiring a target second statement associated with the first statement according to the mapping relation and marking the first statement and the target second statement as parallel corpora.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, implement the parallel corpus acquisition method according to the first aspect.

In a fourth aspect, an embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the parallel corpus acquiring method according to the first aspect.

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for acquiring parallel corpuses, wherein a first sentence list corresponding to a first text and a second sentence list corresponding to a second text are obtained by splitting the first text and the second text which are acquired in advance, and the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relationship between the first statement and the second statement according to the similarity value matrix, wherein the mapping relationship comprises at least one of one pair N, N and one pair, and N is an integer greater than or equal to 2; and acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora. According to the scheme, the mapping relation between the sentences is determined based on the semantic similarity value between the sentences, so that the accuracy of the associated sentence pairs is improved, and the accuracy of the parallel linguistic data is further improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a flowchart of a parallel corpus acquiring method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a parallel corpus acquiring method according to a second embodiment of the present disclosure;

fig. 3 is a flowchart of a parallel corpus acquiring method according to a third embodiment of the present disclosure;

fig. 4 is a structural diagram of a parallel corpus acquiring device according to a fourth embodiment of the present disclosure;

fig. 5 is a structural diagram of an electronic device according to a fifth embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in this disclosure are only used for distinguishing different objects, and are not used for limiting the order or interdependence relationship of the functions performed by the objects.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Example one

Fig. 1 is a flowchart of a parallel corpus acquiring method according to an embodiment of the present disclosure, which is applicable to a situation of acquiring parallel corpuses. Parallel corpora are sentences having a certain association, and may be, for example, sentences having a higher degree of similarity. The method can be executed by a parallel corpus acquiring device, which can be realized in a software and/or hardware manner and can be configured in an electronic device with a data processing function. As shown in fig. 1, the method may include the steps of:

s110, splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

The first text and the second text are in the same language and are used for describing the same content. The first text may be text containing unintelligible words and complex sentences, which may also be referred to as complex text, which is difficult. The second text may be a text containing simple words and simple sentences, which may also be referred to as simple text, and the difficulty of such text is low, and it is easy to understand for learners in foreign languages, people with low knowledge level, or cognitive impairment. The first text and the second text of the embodiment are used for describing the same content, for example, the same object or the same event may be described, and are of the same language, that is, the language types of the first text and the second text are the same, and the embodiment does not limit the specific language types, for example, the specific language types may be chinese, english, japanese, or the like. Alternatively, the first text and the second text describing the same content may be obtained from a hierarchical reading website for storing texts with different difficulty levels or locally.

The first sentence list is used for storing sentences obtained by splitting the first text, and the second sentence list is used for storing sentences obtained by splitting the second text. Alternatively, the first text and the second text may be split separately by a sentence splitting function in NLTK (Natural Language processing Toolkit). Of course, the first text and the second text may be split in other ways, and the embodiment is not limited. In order to distinguish the sentences obtained by splitting, optionally, the sentences may be numbered according to the sequence of the sentences in the corresponding text, and the smaller the number, the more forward the position of the sentence in the text. The first sentence list has the same length as the number of sentences contained in the first text, and the second sentence list has the same length as the number of sentences contained in the second text. The first text may or may not contain the same number of sentences as the second text.

S120, determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix.

The first sentence is obtained by splitting the first text, and the second sentence is obtained by splitting the second text. The semantic similarity value is used to represent the degree of semantic similarity between two sentences, and the present embodiment is used to represent the degree of semantic similarity between a first sentence and a second sentence. Optionally, a value between 0 and 5 may be used to represent the semantic similarity between two sentences, and a smaller value represents a lower semantic similarity between two sentences, for example, 0 represents the lowest semantic similarity between two sentences, or may also consider the semantics of two sentences to be completely different, and 5 represents the highest semantic similarity between two sentences, or may also consider the semantics of two sentences to be completely identical. In the embodiment, the semantic similarity value between the sentences is determined, and when the parallel corpus is obtained subsequently, the sentences with the same sentence semantic information but larger vocabulary difference can be associated, so that the accuracy of the parallel corpus is improved. Alternatively, the first sentence and the second sentence may be input into a neural network model, and a semantic similarity value between the first sentence and the second sentence may be output by the neural network model. The specific structure of the neural network model is not limited in the embodiments, and for example, a Deep semantic model (DSSM) or a Text-to-Text conversion model (Transfer Text-to-Text converter, abbreviated as T5 model) may be used. Of course, the semantic similarity value between the first sentence and the second sentence may be determined in other manners, and the embodiment is not limited.

The similarity value matrix is used for storing semantic similarity values between the first sentences and the second sentences, and optionally, the semantic similarity values between each first sentence and each second sentence may be stored in row units, that is, each row of the similarity value matrix represents one first sentence, and each column of the similarity value matrix represents one second sentence, that is, the number of rows of the similarity value matrix is equal to the number of first sentences contained in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of second sentences contained in the second sentence list. For example, the similarity matrix is denoted T ═ T_xyN, m is the number of the first sentence, n is the number of the second sentence, and t is the number of the first sentence₂₃A semantic similarity value between a second first sentence in the first list of sentences and a third second sentence in the second list of sentences is represented.

S130, determining the mapping relation between the first statement and the second statement according to the similarity value matrix.

Wherein the mapping relationship comprises at least one of a pair N, N one-to-one and a pair one-to-one, and N is an integer greater than or equal to 2. A pair N indicates that a first sentence is associated with a plurality of second sentences, a pair N indicates that a plurality of first sentences are associated with a second sentence, and a pair one indicates that a first sentence is associated with a second sentence. Alternatively, the mapping relationship between the first sentence and the second sentence may be determined according to the semantic similarity value, for example, when the semantic similarity value is equal to a set threshold, the mapping relationship between the first sentence and the second sentence corresponding to the semantic similarity value is considered to be one-to-one, and the set threshold is used to indicate that the semantic similarity between the first sentence and the second sentence is the highest, and may be 5, for example, that is, when the semantic similarity value between the first sentence and the second sentence is 5, the mapping relationship between the first sentence and the second sentence is considered to be one-to-one.

When the semantic similarity value is less than the set threshold, in one example, the mapping relationship between the first statement and the second statement may be determined in combination with the semantic similarity value between the first statement and the other second statement and the semantic similarity value between the other first statement and the second statement. For example, a first sentence is considered to be associated with a plurality of different second sentences if the difference in semantic similarity values between the first sentence and the different second sentences is less than or equal to a preset difference. The magnitude of the preset difference may be set according to actual conditions, and may be set to 0.1, for example. Illustratively, the semantic similarity value between the second first sentence in the first sentence list and the third second sentence in the second sentence list is 2, the semantic similarity value between the second first sentence and the fourth second sentence in the second sentence list is 2.1, the semantic similarity value between the third second sentence and the second sentence in the second sentence list is 0.5, and the semantic similarity values between the third second sentence and the other first sentences in the first sentence list are both less than 1, then the mapping relationship between the second first sentence in the first sentence list and the third second sentence in the second sentence list is considered to be one-to-two, that is, the second first sentence in the first sentence list is associated with the third second sentence and the fourth second sentence in the second sentence list.

When the semantic similarity value is smaller than the set threshold, in an example, a plurality of first sentences or second sentences may also be merged, and the mapping relationship between the first sentences and the second sentences is determined based on the semantic similarity value between the merged sentences. For example, when the semantic similarity value between the third first sentence in the first sentence list and the first second sentence in the second sentence list is less than 5, the first second sentence may be merged to the second sentence, and the third first sentence and the fourth first sentence may be merged, if the semantic similarity value between the third first sentence and the merged second sentence > the semantic similarity value between the merged first sentence and the first second sentence > the semantic similarity value between the fourth first sentence and the second sentence, the mapping relationship between the third first sentence in the first sentence list and the first second sentence in the second sentence list is considered to be a pair of N, if the semantic similarity value between the merged first sentence and the first second sentence > the semantic similarity value between the third first sentence and the merged second sentence > the semantic similarity value between the fourth first sentence and the second sentence, the mapping relationship between the third first sentence in the first sentence list and the first second sentence in the second sentence list is considered to be N-to-one. Of course, the mapping relationship between the first statement and the second statement may also be determined in other ways, and the embodiment is not limited.

It should be noted that, besides the above-mentioned one-to-one, one-to-N, or N-to-one mapping relationship between the first statement and the second statement, it is also possible to use N-to-N, and the mapping relationship between N and N can be obtained by analyzing the one-to-N or N-to-one mapping relationship between two statements. For example, if the second first sentence is associated with the third second sentence and the fourth second sentence, and the third first sentence is associated with the third second sentence and the fourth second sentence, the second sentence and the third sentence in the first text are considered to be associated with the third sentence and the fourth sentence in the second text, i.e., 2-to-2. The nth first sentence is an nth sentence of the first text, and similarly, the nth second sentence is an nth sentence of the second text.

S140, acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora.

The target second sentence is a sentence associated with the first sentence, and may be a single second sentence or a sentence obtained by combining a plurality of second sentences. Specifically, if the mapping relationship between the first statement and the second statement is one-to-one, the first statement and the second statement corresponding to the mapping relationship may be associated to serve as a set of parallel corpora, and the second statement is referred to as a target second statement; if the mapping relationship between the first statement and the second statement is a pair of N, N second statements may be merged, and the first statement and the merged statement are associated to serve as a group of parallel corpora, where the target second statement is a statement after merging the N second statements; if the mapping relationship between the first sentence and the second sentence is N-to-one, N first sentences may be merged, and the merged first sentence and the second sentence are associated as a set of parallel corpora, where the target second sentence is a single second sentence. If the plurality of first sentences and the plurality of second sentences are associated, the plurality of first sentences and the plurality of second sentences may be combined, and the combined sentences are associated to be used as a group of parallel corpora, at this time, the target second sentence is a sentence formed by combining the plurality of second sentences.

In this embodiment, semantic similarity values between the sentences are determined based on the semantic information of the sentences, and the mapping relationship between the sentences is determined according to the semantic similarity values, so that the accuracy of the associated sentence pairs is improved, and the accuracy of the parallel corpus is improved. The accuracy of the text simplification model can be improved when the text simplification model is trained by using the parallel corpus, and the accuracy of the conversion result can be improved when the trained text simplification model is used for converting a complex text into a simple text.

The first embodiment of the present disclosure provides a method for obtaining parallel corpuses, where a first sentence list corresponding to a first text and a second sentence list corresponding to a second text are obtained by splitting a first text and the second text which are obtained in advance, where the first text and the second text are in the same language and are used to describe the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relationship between the first statement and the second statement according to the similarity value matrix, wherein the mapping relationship comprises at least one of one pair N, N and one pair, and N is an integer greater than or equal to 2; and acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora. According to the scheme, the mapping relation between the sentences is determined based on the semantic similarity value between the sentences, so that the accuracy of the associated sentence pairs is improved, and the accuracy of the parallel linguistic data is further improved.

Example two

Fig. 2 is a flowchart of a parallel corpus acquiring method according to a second embodiment of the present disclosure, where the embodiment is optimized based on the foregoing embodiment, and referring to fig. 2, the method may include the following steps:

s210, splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

S220, inputting a first sentence in the first sentence list and a second sentence in the second sentence list into a semantic similarity value model, and outputting the semantic similarity value of the first sentence and the second sentence by the semantic similarity value model.

The semantic similarity value model is obtained by training sentence pairs with different semantic similarity values. The semantic similarity value model is used for subsequently determining a semantic similarity value between any two sentences, and the semantic similarity value model of the embodiment takes the T5 model as an example. Prior to application, the T5 model may be trained. Optionally, sentence pairs with different semantic similarity values can be obtained from the public data set STS-B as training samples. The public data set STS-B is used to store pairs of sentences with different semantic similarity values. Semantic similarity values may be represented by numbers between 0-5, where 0 may represent a complete difference in semantics between the two statements; 1 may mean that the semantics differ between two sentences, but that the described topics are consistent; 2 may mean that the semantics are different between two statements, but a small part of the information is consistent; 3 may mean that the semantics between the two statements are substantially consistent, but there is partial inconsistency or loss of important information; 4 may mean that the semantics between the two statements are very similar, but there is a partial non-trivial information inconsistency; 5 may indicate that the semantics are identical between the two statements.

In this embodiment, a sentence pair with different semantic similarity values is used as a training sample, a semantic similarity value model is input, and the semantic similarity value model is trained, so that the trained semantic similarity value model can determine the semantic similarity value between any two sentences. Optionally, for each first sentence, the first sentence and one second sentence in the second sentence list may be input into the trained semantic similarity value model, and the semantic similarity value between the first sentence and each second sentence is determined in sequence; or inputting the first sentence and all the second sentences into the trained semantic similarity value model, and simultaneously determining the semantic similarity value between the first sentence and each second sentence; the second statement and all the first statements can be input into the trained semantic similarity value model aiming at each second statement, and the semantic similarity value between the second statement and each first statement is determined at the same time; all the first sentences and all the second sentences can be input into the trained semantic similarity value model, and the semantic similarity value between each first sentence and each second sentence is determined, so that the efficiency can be improved. The embodiment determines the semantic similarity value between the sentences, and can accurately relate to the sentences with the same semantic information but larger vocabulary difference when the parallel linguistic data are subsequently determined, so that the accuracy of the parallel linguistic data is improved. In addition, for complex sentences with large syntax change and large vocabulary deletion, the complex sentences can be accurately associated to the simple sentences.

And S230, sequentially arranging the semantic similarity values corresponding to the first sentences to obtain a similarity value matrix.

Wherein the number of rows of the similarity value matrix is equal to the number of first sentences contained in the first sentence list, and the number of columns of the similarity value matrix is equal to the number of second sentencesThe second sentence list includes a number of second sentences. For example, the similarity matrix can be represented by T, T_n ^mRepresenting semantic similarity values between m first sentences and n second sentences, m and n being the length of the first sentence list and the length of the second sentence list, e.g. T_n1 denotes a semantic similarity value, T, between a first sentence of a first text and each sentence of a second text₁ ^mA semantic similarity value between the first sentence of the second text and the sentences of the first text.

S240, determining the mapping relation between the first statement and the second statement according to the similarity value matrix.

S250, acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora.

And S260, inputting the parallel corpus into a text simplified model, training the text simplified model, and obtaining a target text simplified model.

Wherein the target text reduction model is used for converting complex text into simple text. After the parallel corpora are determined, the first sentence can be input into the text simplified model, the text simplified model outputs the predicted sentence, parameters of the text simplified model are adjusted according to the deviation of the predicted sentence and the second sentence until the deviation of the predicted sentence and the second sentence meets set conditions, the target text simplified model is obtained, and therefore the complex text can be input into the target text simplified model, the simple text is output through the target text simplified model, and conversion from the complex text to the simple text is achieved.

The second embodiment of the disclosure provides a parallel corpus acquiring method, based on the above embodiments, a semantic similarity value model is trained by using a sentence pair with different semantic similarity values, the semantic similarity value between sentences is determined by using the trained semantic similarity value model, a similarity value matrix is obtained, further, a mapping relation between sentences is determined according to the similarity value matrix, an associated sentence pair is obtained, and accuracy of the associated sentence pair is improved.

EXAMPLE III

Fig. 3 is a flowchart of a parallel corpus acquiring method according to a third embodiment of the present disclosure, where the embodiment is optimized based on the foregoing embodiment, and referring to fig. 3, the method may include the following steps:

s310, splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

S320, determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix.

S330, recording the position of the first element of the similarity value matrix as the current element position.

Assuming that the similarity matrix is m rows and n columns, the first element may be an element corresponding to the first row and the first column of the similarity matrix, the second element may be an element corresponding to the first row and the second column of the similarity matrix, the n +1 th element may be an element corresponding to the second row and the first column, and so on. The present embodiment performs a similar process on each element in the similarity matrix, where the first element is taken as the current element, and the position of the first element is taken as the current element position, and the process that needs to be performed is described. Each element position corresponds to a semantic similarity value, which is a semantic similarity value between the first sentence and the second sentence corresponding to the element position, for example, the semantic similarity value corresponding to the element position in the second row and the third column is a semantic similarity value between the second sentence in the first text and the third sentence in the second text, which may also be referred to as a semantic similarity value between the second first sentence in the first text and the third second sentence in the second text.

And S340, judging whether the semantic similarity value corresponding to the current element position is equal to a first preset value or not, if so, executing S350, otherwise, executing S360.

The first preset value is used for representing that the semantic similarity degree of the first statement and the second statement corresponding to the current element position is highest. The first preset value is related to a semantic similarity value of a training sample used for training the semantic similarity value model, for example, the semantic similarity value of the training sample used for training the semantic similarity value model is between 0 and 5, and the first preset value may be 5, so as to indicate that the semantic similarity degree of the two sentences is the highest, that is, the semantic similarity value corresponding to the current element position is either equal to 5 or less than 5.

And S350, determining that the mapping relation between the first statement and the second statement corresponding to the current element position is one-to-one.

Specifically, if the semantic similarity value between the first sentence and the second sentence corresponding to the current element position is 5, the semantics between the first sentence and the second sentence are considered to be completely the same, and the mapping relationship between the first sentence and the second sentence corresponding to the current element position can be determined to be one-to-one. And then executing S380, taking the next element position as the current element position, returning to the step of executing S340, and continuously judging the mapping relation between the first statement and the second statement corresponding to the current element position.

S360, combining the second sentence corresponding to the current element position and the second sentence corresponding to the next element position to obtain a first combined sentence; and combining the first statement corresponding to the current element position and the first statement corresponding to the next element position to obtain a second combined statement.

Considering that the parallel corpus is two sentences with higher semantic similarity, when the semantic similarity value corresponding to the current element position is less than 5 and greater than the set threshold, it can be further determined whether the first sentence and the second sentence corresponding to the current element position are in a one-to-N or N-to-one relationship. For example, a certain number of sentences may be merged, and whether there is a one-to-N or N-to-one relationship between the first sentence and the second sentence corresponding to the current element position is determined according to the semantic similarity value between the merged sentences. The size of the set threshold may be determined according to actual conditions, for example, when the semantic similarity value is 3, the semantic similarity value indicates that the semantics of the two sentences are substantially the same, so the set threshold may be set to a value near 3 or 3.

Optionally, the second sentence corresponding to the current element position and the second sentence corresponding to the next element position may be merged to obtain the first merged wordSentence S (t)_y:t_y+1)，t_yIndicating a second statement, t, corresponding to the current element position_y+1 denotes a second sentence corresponding to the next element position, y ═ 1, 2. Similarly, the first sentence corresponding to the current element position and the first sentence corresponding to the next element position may be merged to obtain a second merged sentence C (t)_x:t_x+1)，t_xIndicating the first sentence, t, corresponding to the current element position_x+1 denotes a first sentence corresponding to the next element position, x ═ 1, 2.

S370, determining the mapping relation between the first sentence and the second sentence corresponding to the current element position according to the semantic similarity value between the first sentence and the first combined sentence corresponding to the current element position, the semantic similarity value between the second combined sentence and the second sentence corresponding to the current element position, and the semantic similarity value corresponding to the next element position.

Optionally, a first sentence t corresponding to the current element position may be determined_xAnd the first merged statement S (t)_y:t_y+1) semantic similarity value sim₁The second merged statement C (t)_x:t_x+1) second sentence t corresponding to the current element position_ySemantic similarity value sim of₂According to sim₁、sim₂And determining the first sentence t corresponding to the current element position by the semantic similarity value corresponding to the next element position_xAnd the second sentence t_yThe mapping relationship between them. For convenience of description, the first sentence t corresponding to the current element position may be_xAnd the first merged statement S (t)_y:t_y+1) semantic similarity value sim₁Is recorded as a first semantic similarity value and a second merged statement C (t)_x:t_x+1) second sentence t corresponding to the current element position_ySemantic similarity value sim of₂And recording the semantic similarity value as a second semantic similarity value, recording the semantic similarity value corresponding to the current element position as a third semantic similarity value, and recording the semantic similarity value corresponding to the next element position as a fourth semantic similarity value.

In one example, the mapping relationship between the first statement and the second statement may be determined by:

if the first semantic similarity value is smaller than or equal to the third semantic similarity value or the first semantic similarity value is smaller than or equal to the fourth semantic similarity value, reducing the first semantic similarity value, otherwise, keeping the first semantic similarity value unchanged;

if the second semantic similarity value is smaller than or equal to the third semantic similarity value or the second semantic similarity value is smaller than or equal to the fourth semantic similarity value, reducing the second semantic similarity value, otherwise, keeping the second semantic similarity value unchanged;

determining a maximum value of the first semantic similarity value, the second semantic similarity value, and the fourth semantic similarity value;

if the maximum value is the fourth semantic similarity value, determining that the mapping relation between the first statement and the second statement corresponding to the current element position is one-to-one; if the maximum value is the first semantic similarity value, determining that the mapping relation between the first statement and the second statement corresponding to the current element position is a pair N; and if the maximum value is the second semantic similarity value, determining that the mapping relation between the first statement and the second statement corresponding to the current element position is N-to-one.

In particular, if the first semantic similarity value sim₁Less than or equal to the third semantic similarity value, or the first semantic similarity value sim₁If the semantic similarity value is less than or equal to the fourth semantic similarity value, the first semantic similarity value sim is reduced₁Otherwise the first semantic similarity value sim is maintained₁Invariant, then comparing the second semantic similarity value sim₂Relation to a third semantic similarity value and a fourth semantic similarity value, e.g. if the second semantic similarity value sim₂Less than or equal to the third semantic similarity value, or the second semantic similarity value sim₂If the semantic similarity value is less than or equal to the fourth semantic similarity value, the second semantic similarity value sim is reduced₁Otherwise the second semantically similar value sim is kept₂And is not changed. Of course, the second semantic similarity value sim may be compared first₂Comparing the first semantic similarity with the third semantic similarity and the fourth semantic similarityValue sim₁The process is similar to the relationship of the third semantic similarity value and the fourth semantic similarity value. On the basis of this, the first semantic similarity value sim is compared₁Second semantic similarity value sim₂And a fourth semantic similarity value, where the first semantic similarity value sim₁Second semantic similarity value sim₂For the value after the adjustment operation, if the comparison result is the first semantic similarity value sim₁If the semantic similarity value sim is the second semantic similarity value sim, the mapping relation between the first statement and the second statement of the current element position is considered as a pair N, and if the comparison result is the second semantic similarity value sim₂And if the comparison result is that the fourth semantic similarity value is maximum, the mapping relation between the first statement and the second statement of the current element position is considered to be one-to-one.

In the above process, when the first semantic similarity value sim needs to be reduced₁Or a second semantic similarity value sim₂In this case, the embodiment does not limit the specific reduction amount, as long as the comparison of the first semantic similarity value sim in the following process is ensured₁Second semantic similarity value sim₂The semantic similarity value is distinguished from the fourth semantic similarity value, so that the mapping relation between the first statement and the second statement can be accurately determined. For example, when the first semantic similarity value sim needs to be reduced₁Then, the first semantic similarity value sim can be directly set₁Set to 0, when the second semantic similarity value sim needs to be reduced₂Then, the second semantic similarity value sim can be directly set₂Is set to 0.

The size of N is further determined when it is determined that the mapping relationship between the first statement and the second statement for the current element position is one-to-N or N-to-one. For example, when it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is a pair of N, the size of N may be determined as follows:

merging the second sentence corresponding to the current element position to the second sentence corresponding to the target element position to obtain a third merged sentence, wherein the column in which the target element position is located is the sum of the column in which the current element position is located and N-1, and N is 3;

if the semantic similarity value of the first sentence and the third combined sentence corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, determining that N is 2;

otherwise, making N equal to N +1, and repeatedly executing the above operations until the obtained semantic similarity value of the first sentence and the third merged sentence corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, and determining N equal to N-1.

A pair of N indicates that one first sentence corresponds to a plurality of second sentences, N is greater than or equal to 2, at this time, N may be first added by 1, that is, it is determined whether N is 3, for example, three second sentences may be merged, that is, the second sentence corresponding to the current element position, the second sentence corresponding to the next element position, and the second sentence corresponding to the next element position may be merged to obtain a third merged sentence, it should be noted that the element positions where the three merged second sentences are located correspond to the same first sentence. And then determining a semantic similarity value between the first sentence and the third merged sentence corresponding to the current element position, stopping searching and determining that N is 2 if the semantic similarity value between the first sentence and the third merged sentence corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, adding 1 to N if the semantic similarity value between the first sentence and the third merged sentence corresponding to the current element position is greater than the semantic similarity value corresponding to the current element position, and continuing to judge until the obtained semantic similarity value between the first sentence and the third merged sentence corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, wherein N is N-1.

In the embodiment, the semantic similarity value after the first sentence and the plurality of second sentences are combined or the semantic similarity value between the sentences after the plurality of first sentences are combined is further determined on the basis of the similarity value matrix, and the mapping relationship between the first sentence and the second sentence is determined on the basis, so that the accuracy of the associated sentence pairs is improved, some sentences with the same or similar semantics but larger vocabulary difference can be accurately associated, some complex sentences with larger syntax change and more vocabulary deletion can be associated to the simple sentences, and the number of parallel corpora is increased. For example, "You short outer be term found tagging fields" and "fat before You tag a fields", "It's more common high in its clusters" and "But up in The clusters, It's cameras" and "The surface of The channel burning temperature and weighting pressures" and "It is top and string pressures" can be accurately associated with this embodiment.

When it is determined that the mapping relationship between the first statement and the second statement corresponding to the current element position is N-to-one, the size of N may be determined as follows:

merging the first sentence corresponding to the current element position to the first sentence corresponding to the target element position to obtain a fourth merged sentence, wherein the column in which the target element position is located is the sum of the column in which the current element position is located and N-1, and N is 3;

if the semantic similarity value of the second sentence corresponding to the fourth merged sentence and the current element position is less than or equal to the semantic similarity value corresponding to the current element position, determining that N is 2;

otherwise, making N equal to N +1, and repeatedly executing the above operations until the semantic similarity value of the obtained fourth merged statement and the second statement corresponding to the current element position is less than or equal to the semantic similarity value corresponding to the current element position, and determining that N is equal to N-1.

When the mapping relationship between the first statement and the second statement corresponding to the current element position is N-to-one, and the determination process of the size of N and the mapping relationship between the first statement and the second statement corresponding to the current element position are N-to-one, the determination process of the size of N is similar, and details are not repeated here.

And S380, whether the current element position is the position of the last element in the similarity value matrix or not, if so, executing S3100, otherwise, executing S390 and then returning to execute S340.

And S390, recording the position of the next element as the current element position.

S3100, acquiring a target second statement related to the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora.

And after the traversal of the element positions in the similarity value matrix is finished, acquiring a target second statement related to the first statement according to the mapping relation, and recording the first statement and the target second statement as parallel corpora.

S3110, inputting the parallel corpus into a text simplified model, and training the text simplified model to obtain a target text simplified model.

In an example, to facilitate statistics of parallel corpuses, after obtaining the similarity value matrix, a sentence association matrix P may be initialized synchronously to store whether the first sentence and the second sentence are associated, an initial value of each element in the sentence association matrix P is 0, rows and columns of the sentence association matrix P are the same as the similarity value matrix T, a value corresponding to a certain element position of the sentence association matrix P indicates whether the first sentence and the second sentence in the same element position in the similarity value matrix T are associated, and then the parallel corpuses may be directly obtained according to values of each element position in the sentence association matrix P. For example, when the mapping relationship between the second first sentence of the first text and the third second sentence of the second text is determined to be one-to-one, the second row and the third column of the sentence association matrix P may be synchronously set to 1 to indicate that the first sentence at the position is associated with the second sentence. For another example, when it is determined that the second first sentence of the first text is related to the second to fourth second sentences of the second text, the second row, the second column, the second row, the fourth column, and the like of the sentence relation matrix P may be set to 1. For another example, when determining that the second to fifth first sentences of the first text are associated with the third second sentence of the second text, the second row, the third column, the third row, the third column, the fourth row, the third column, and the fifth row, the third column of the sentence association matrix P may be set to 1. After traversal is finished, the sentence association matrix P is taken out according to the element position of 1, and sentences corresponding to the element position are combined into a complex sentence-a simple sentence, so that the parallel corpus is obtained.

The third embodiment of the disclosure provides a method for acquiring parallel corpuses, which determines semantic similarity values among sentences on the basis of the above embodiments, determines mapping relations among the sentences according to the semantic similarity values, improves accuracy of associating sentence pairs, enables some sentences with the same or similar semantics but larger vocabulary difference to be accurately associated, enables some complex sentences with large syntax change and large vocabulary deletion to be associated with simple sentences, and increases the number of parallel corpuses.

Example four

Fig. 4 is a structural diagram of a parallel corpus acquiring apparatus according to a fourth embodiment of the present disclosure, where the apparatus may execute the parallel corpus acquiring method according to the foregoing embodiment, and as shown in fig. 4, the apparatus may include:

a splitting module 41, configured to split a first text and a second text that are obtained in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, where the first text and the second text are in the same language and are used to describe the same content;

a similarity value matrix determining module 42, configured to determine semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list, so as to obtain a similarity value matrix;

a mapping relation determining module 43, configured to determine, according to the similarity value matrix, a mapping relation between the first statement and the second statement, where the mapping relation includes at least one of a pair N, N, i.e., N is an integer greater than or equal to 2;

a parallel corpus obtaining module 44, configured to obtain a target second sentence associated with the first sentence according to the mapping relationship, and mark the first sentence and the target second sentence as a parallel corpus.

A fourth embodiment of the present disclosure provides a parallel corpus acquiring apparatus, where a first sentence list corresponding to a first text and a second sentence list corresponding to a second text are obtained by splitting the first text and the second text which are acquired in advance, where the first text and the second text are in a same language and are used to describe a same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relationship between the first statement and the second statement according to the similarity value matrix, wherein the mapping relationship comprises at least one of one pair N, N and one pair, and N is an integer greater than or equal to 2; and acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora. According to the scheme, the mapping relation between the sentences is determined based on the semantic similarity value between the sentences, so that the accuracy of the associated sentence pairs is improved, and the accuracy of the parallel linguistic data is further improved.

On the basis of the foregoing embodiment, the similarity value matrix determining module 42 is specifically configured to:

inputting a first sentence in the first sentence list and a second sentence in the second sentence list into a semantic similarity value model, and outputting a semantic similarity value of the first sentence and the second sentence by the semantic similarity value model, wherein the semantic similarity value model is obtained by training pairs of sentences with different semantic similarity values;

and sequentially arranging semantic similarity values corresponding to the first sentences to obtain a similarity value matrix, wherein the row number of the similarity value matrix is equal to the number of the first sentences contained in the first sentence list, and the column number of the similarity value matrix is equal to the number of the second sentences contained in the second sentence list.

On the basis of the foregoing embodiment, the mapping relationship determining module 43 is specifically configured to:

recording the position of the first element of the similarity value matrix as the current element position;

if the semantic similarity value corresponding to the current element position is equal to a first preset value, determining that the mapping relation between a first statement and a second statement corresponding to the current element position is one-to-one, wherein the first preset value is used for indicating that the semantic similarity degree between the first statement and the second statement corresponding to the current element position is the highest;

the position of the next element is recorded as the current element position, and the above operation is repeatedly performed.

if the semantic similarity value corresponding to the current element position is smaller than a first preset value, combining a second statement corresponding to the current element position with a second statement corresponding to a next element position to obtain a first combined statement, wherein the first preset value is used for indicating that the semantic similarity degree of the first statement and the second statement corresponding to the current element position is highest; merging the first statement corresponding to the current element position and the first statement corresponding to the next element position to obtain a second merged statement;

determining the mapping relation between the first statement and the second statement corresponding to the current element position according to the semantic similarity value between the first statement and the first combined statement corresponding to the current element position, the semantic similarity value between the second combined statement and the second statement corresponding to the current element position, and the semantic similarity value corresponding to the next element position;

On the basis of the above embodiment, a semantic similarity value of the first sentence and the first merged sentence corresponding to the current element position is recorded as a first semantic similarity value, a semantic similarity value of the second merged sentence and the second sentence corresponding to the current element position is recorded as a second semantic similarity value, a semantic similarity value corresponding to the current element position is recorded as a third semantic similarity value, and a semantic similarity value corresponding to the next element position is recorded as a fourth semantic similarity value;

the mapping relationship determining module 43 is specifically configured to:

On the basis of the above embodiment, when the mapping relationship between the first statement and the second statement corresponding to the current element position is a pair N, the determination process of the size of N is as follows:

On the basis of the above embodiment, when the mapping relationship between the first statement and the second statement corresponding to the current element position is N-to-one, the determination process of the size of N is as follows:

On the basis of the foregoing embodiment, the parallel corpus acquiring module 44 is specifically configured to:

if the mapping relation is one-to-one, recording a second statement corresponding to the mapping relation as a target second statement;

if the mapping relation is a pair of N, merging second sentences corresponding to the mapping relation to obtain target second sentences;

and if the mapping relation is N-to-one, combining the first statement corresponding to the mapping relation, and marking the second statement corresponding to the mapping relation as a target second statement.

On the basis of the above embodiment, the apparatus may further include:

and the training module is used for inputting the parallel linguistic data into a text simplified model after the first sentence and the target second sentence are marked as the parallel linguistic data, training the text simplified model to obtain a target text simplified model, and the target text simplified model is used for converting a complex text into a simple text.

The parallel corpus acquiring apparatus provided in the embodiment of the present disclosure and the parallel corpus acquiring method provided in the embodiment belong to the same concept, and the technical details not described in detail in the embodiment can be referred to the embodiment, and the embodiment has the same beneficial effects as the parallel corpus acquiring method.

EXAMPLE five

Reference is now made to fig. 5, which illustrates a schematic diagram of an electronic device) 500 suitable for use in implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

EXAMPLE six

The computer readable medium described above in this disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (Hyper Text Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: splitting a first text and a second text which are acquired in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text, wherein the first text and the second text are in the same language and are used for describing the same content; determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix; determining a mapping relationship between the first statement and the second statement according to the similarity value matrix, wherein the mapping relationship comprises at least one of one pair N, N and one pair, and N is an integer greater than or equal to 2; and acquiring a target second statement associated with the first statement according to the mapping relation, and marking the first statement and the target second statement as parallel corpora.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. For example, the splitting module may be further described as a module that splits a first text and a second text obtained in advance to obtain a first sentence list corresponding to the first text and a second sentence list corresponding to the second text.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, the present disclosure provides a parallel corpus obtaining method, including:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, the determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix includes:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, the determining a mapping relationship between the first sentence and the second sentence according to the similarity value matrix includes:

According to one or more embodiments of the present disclosure, in the parallel corpus acquiring method provided by the present disclosure, a semantic similarity value of a first sentence and a first merged sentence corresponding to the current element position is recorded as a first semantic similarity value, a semantic similarity value of a second sentence corresponding to the second merged sentence and the current element position is recorded as a second semantic similarity value, a semantic similarity value corresponding to the current element position is recorded as a third semantic similarity value, and a semantic similarity value corresponding to the next element position is recorded as a fourth semantic similarity value;

determining a mapping relationship between a first sentence and a second sentence corresponding to the current element position according to the semantic similarity value between the first sentence and the first merged sentence corresponding to the current element position, the semantic similarity value between the second merged sentence and the second sentence corresponding to the current element position, and the semantic similarity value corresponding to the next element position, including:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is a pair N, a determination process of the size of N is as follows:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is N-to-one, a determination process of the size of N is as follows:

According to one or more embodiments of the present disclosure, in the parallel corpus obtaining method provided by the present disclosure, the obtaining a target second sentence associated with the first sentence according to the mapping relationship includes:

According to one or more embodiments of the present disclosure, in the parallel corpus acquiring method provided by the present disclosure, after the first sentence and the target second sentence are recorded as parallel corpuses, the method further includes:

and inputting the parallel corpus into a text simplified model, training the text simplified model to obtain a target text simplified model, wherein the target text simplified model is used for converting a complex text into a simple text.

According to one or more embodiments of the present disclosure, there is provided a parallel corpus acquiring apparatus including:

In accordance with one or more embodiments of the present disclosure, there is provided an electronic device including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, implement a parallel corpus acquisition method according to any of the present disclosure.

According to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a parallel corpus acquisition method according to any one of the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A parallel corpus acquiring method is characterized by comprising the following steps:

2. The method of claim 1, wherein determining semantic similarity values between each first sentence in the first sentence list and each second sentence in the second sentence list to obtain a similarity value matrix comprises:

3. The method of claim 1, wherein determining the mapping relationship between the first sentence and the second sentence according to the similarity value matrix comprises:

4. The method of claim 1, wherein determining the mapping relationship between the first sentence and the second sentence according to the similarity value matrix comprises:

5. The method according to claim 4, wherein the semantic similarity value of the first sentence corresponding to the current element position and the first merged sentence is denoted as a first semantic similarity value, the semantic similarity value of the second merged sentence corresponding to the current element position is denoted as a second semantic similarity value, the semantic similarity value corresponding to the current element position is denoted as a third semantic similarity value, and the semantic similarity value corresponding to the next element position is denoted as a fourth semantic similarity value;

6. The method of claim 5, wherein when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is a pair of N, the size of N is determined as follows:

7. The method of claim 5, wherein when the mapping relationship between the first sentence and the second sentence corresponding to the current element position is N-to-one, the size of N is determined as follows:

8. The method of claim 1, wherein obtaining the target second sentence associated with the first sentence according to the mapping relationship comprises:

9. The method according to any of claims 1-8, wherein after said first sentence and said target second sentence are denoted as parallel corpora, further comprising:

10. A parallel corpus acquiring apparatus, comprising:

11. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, implement the parallel corpus acquisition method of any of claims 1-9.

12. A computer-readable storage medium on which a computer program is stored, the program, when being executed by a processor, implementing the parallel corpus acquisition method according to any one of claims 1-9.