CN112347755A - Bilingual corpus generation method, text processing system and subtitle file processing method - Google Patents

Bilingual corpus generation method, text processing system and subtitle file processing method Download PDF

Info

Publication number
CN112347755A
CN112347755A CN201910733742.8A CN201910733742A CN112347755A CN 112347755 A CN112347755 A CN 112347755A CN 201910733742 A CN201910733742 A CN 201910733742A CN 112347755 A CN112347755 A CN 112347755A
Authority
CN
China
Prior art keywords
subtitle
text
language
line
bilingual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910733742.8A
Other languages
Chinese (zh)
Inventor
葛鑫
施杨斌
赵宇
骆卫华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910733742.8A priority Critical patent/CN112347755A/en
Publication of CN112347755A publication Critical patent/CN112347755A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a bilingual corpus generation method, which comprises the following steps: obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text; generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file; and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data. The method solves the problem that the quality of the extracted bilingual corpus is not high in the existing method for extracting the parallel corpus by adopting the monolingual subtitle file.

Description

Bilingual corpus generation method, text processing system and subtitle file processing method
Technical Field
The application relates to the technical field of computers, in particular to a bilingual corpus generating method, and further relates to a bilingual corpus generating device, electronic equipment and storage equipment. The application also relates to a text processing system and a subtitle file processing method.
Background
With the development of computer technology and artificial intelligence, machine translation has rapidly developed. Bilingual parallel corpus data is an important basic element for training a machine translation model. Parallel corpora are from a variety of sources, and subtitles are one of the important sources. The subtitle files have the advantages of high quality, continuous production and the like, and are divided into two types of monolingual subtitle files and bilingual mixed subtitle files, wherein the monolingual subtitle files, namely each subtitle file only contains one natural language. The bilingual mixed subtitle file, i.e. one subtitle file, contains two languages, for example, bilingual subtitles such as chinese and english are provided when most foreign movies introduced in China are shown.
In the prior art, a monolingual subtitle file is usually used as a source of parallel linguistic data, and structural information such as time axis coincidence information, movie and television duration, subtitle scoring by a user, subtitle uploading time and the like is used for aligning the monolingual subtitle file and finding out a matched subtitle text pair. However, there are some problems in using a single-language subtitle file as a source of parallel corpora, for example, using two lines of chinese and english to display a long sentence, because the chinese and english have different language sequences, the first line of chinese and the first line of english are not necessarily completely matched, resulting in low quality of extracted corpora.
Therefore, the method for extracting the parallel corpus by adopting the monolingual subtitle file in the prior art has the problem of low quality of the extracted corpus.
Disclosure of Invention
The application provides a bilingual corpus generation method and device, electronic equipment and storage equipment, and aims to solve the problem that the quality of extracted corpora is not high in the existing method for extracting parallel corpora by adopting a monolingual subtitle file.
The application provides a bilingual corpus generation method, which comprises the following steps:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
Optionally, the generating of the subtitle text aligned between the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file includes:
generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file; wherein the subtitle text pair includes a first language subtitle text file and a second language subtitle text file.
Optionally, the obtaining the bilingual mixed subtitle file includes:
obtaining an original subtitle file;
and screening out bilingual mixed subtitle files from the original subtitle files.
Optionally, the screening of the bilingual mixed subtitle file from the original subtitle file includes:
if the original subtitle file contains two kinds of language information with score values larger than or equal to a preset language score threshold value;
the original subtitle file is treated as a bilingual mixed subtitle file.
Optionally, the method further includes:
preprocessing the original subtitle file to obtain a preprocessed subtitle file;
the method for screening out the bilingual mixed subtitle file from the original subtitle file comprises the following steps:
and screening out bilingual mixed subtitle files from the preprocessed subtitle files.
Optionally, the preprocessing the original subtitle file includes at least one of the following processing:
removing impurity data contained in the original subtitle file;
performing language identification on the text of the original subtitle file;
acquiring a file name of an original subtitle file;
performing complex and simplified conversion on the text contained in the original subtitle file;
and deleting the original subtitle file which does not accord with the preset language direction.
Optionally, the generating a caption text pair in which the first language caption text and the second language caption text are aligned according to the bilingual mixed caption file includes:
performing line division processing on the bilingual mixed subtitle file, and generating a line record for each line text in the bilingual mixed subtitle file;
merging the line texts in the line records according to the language information of the line texts in the line records;
and sequencing the combined line texts to generate aligned caption text pairs.
Optionally, the merging the line texts in the line record according to the language information of the line texts in the line record includes:
storing the line texts in the line records of the first language belonging to the same bilingual mixed subtitle file in a first array according to the language information of the line texts in the line records;
and storing the line texts in the line records of the second language belonging to the same bilingual mixed subtitle file in a second array according to the language information of the line texts in the line records.
Optionally, the sorting the combined line texts to generate aligned text pairs includes:
respectively sequencing the line texts in the first array and the line texts in the second array according to the line numbers of the line texts;
generating a first language subtitle text file according to the sorted first array, and generating a second language subtitle text file according to the sorted second array;
and combining the first language subtitle text file and the second language subtitle text file to generate an aligned text pair.
Optionally, the method further includes: and determining the language to which the line text in the line record belongs according to the content of the line text in the line record.
Optionally, the determining, according to the content of the line text in the line record, the language to which the line text in the line record belongs includes:
obtaining the language to which the content contained in the line text belongs according to the content of the line text in the line record;
and taking the language with the highest score in the languages to which the contents contained in the line text belong as the language of the line text in the line record.
Optionally, the information recorded in the row includes:
the line records the identification information of the bilingual mixed subtitle file to which the line belongs;
-textual information;
the line number information of the line record.
Optionally, the performing sentence pair extraction according to the aligned subtitle text, and using the extracted sentence pair as a bilingual corpus includes:
extracting sentence pairs from the aligned caption texts;
and judging whether the difference of the time axis information of the two sentences contained in the sentence pair is greater than a time axis difference threshold value, if so, deleting the sentence pair.
Optionally, the method further includes:
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
The present application further provides a bilingual corpus generating device, including:
a bilingual mixed subtitle file obtaining unit for obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a text in a first language and a text in a second language;
the aligned caption text generating unit is used for generating a caption text aligned with the first language caption text and the second language caption text according to the bilingual mixed caption file;
and the bilingual corpus generating unit is used for extracting sentence pairs according to the aligned caption texts and taking the extracted sentence pairs as bilingual corpora.
The present application further provides an electronic device, comprising:
a processor; and
a memory for storing a program of a bilingual corpus generating method, the device being powered on and executing the program of the bilingual corpus generating method by the processor, and then performing the following steps:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
The present application further provides a storage device storing a program of a bilingual corpus generation method, the program being executed by a processor to perform the steps of:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
Compared with the prior art, the method has the following advantages:
the application provides a bilingual corpus generating method and device, electronic equipment and storage equipment, a subtitle text with a first language subtitle text aligned with a second language subtitle text is generated according to a bilingual mixed subtitle file, sentence pairs are extracted from the aligned subtitle text, the bilingual corpus is obtained from the bilingual mixed subtitle file, the aligned scenes of the linguistic corpora of 'one-to-one', 'one-to-many', 'many-to-one' and 'many-to-many' are solved, and the generated bilingual corpus is higher in quality than the bilingual corpus obtained from a monolingual subtitle file.
Drawings
Fig. 1 is a flowchart of a bilingual corpus generating method according to a first embodiment of the present application.
Fig. 2 is a flowchart for generating aligned text pairs from the bilingual mixed-caption file according to the first embodiment of the present application.
Fig. 3 is an example of generating bilingual corpus from an original subtitle file using the method according to the first embodiment of the present application.
Fig. 4 is a schematic diagram of a bilingual corpus generating apparatus according to a second embodiment of the present application.
Fig. 5 is a schematic diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather construed as limited to the embodiments set forth herein.
A first embodiment of the present application provides a method for generating bilingual corpus, which is described below with reference to fig. 1.
As shown in fig. 1, in step S101, a bilingual mixed-subtitle file is obtained; the bilingual mixed-caption file includes a first language caption text and a second language caption text.
The bilingual mixed subtitle file refers to a subtitle file containing text in two languages, for example, if one subtitle file contains two languages, namely english and chinese, and english is a first language and chinese is a second language, the subtitle file can be called a bilingual mixed subtitle file. The format of the bilingual mixed subtitle file in Chinese and English is generally that a line of Chinese text and a line of English text are arranged under the same time axis. The following is an example of a bilingual mixed-caption file:
1
00:00:06,673-->00:00:09,541
dog is X
"Y is the one with the tail."
2
00:00:10,777-->00:00:12,844
O me want to see this
God,I want to see this.
3
00:00:12,880-->00:00:14,246
Find the desired mollewis
Finding everything okay,louis?
...
104
00:04:22,896-->00:04:24,829
Hiccup is the bed bottom of I's sleep in Jie xi Ka
Uh,it's just...I found some tapes
105
00:04:24,865-->00:04:26,965
Some video discs are found
Under jessica's side of the bed.
106
00:04:27,868-->00:04:29,134
All of which are half-nude actors
Shirtless-men tapes.
Wherein the numbers 1, 2, 3.. 106 represent row numbers; the next line of the line number is time axis information, for example: 00:00:06,673- - >00:00:09,541; the next line of timeline information is chinese text, for example: dog is X; the next action of the chinese text is associated with the english text corresponding to the chinese text, for example: "is the one with the tail. Wherein, Y is English translation of X.
The obtaining of the bilingual mixed subtitle file includes:
obtaining an original subtitle file;
and screening out bilingual mixed subtitle files from the original subtitle files.
The original caption file is a caption file input to the bilingual corpus generation platform. Since the original subtitle files include not only the bilingual hybrid subtitle files but also the monolingual subtitle files, it is necessary to screen out the bilingual hybrid subtitle files from a large number of original subtitle files (e.g., millions of original subtitle files).
Specifically, the method for screening the bilingual mixed subtitle file from the original subtitle file comprises the following steps:
determining language information contained in an original subtitle file;
if the original subtitle file contains two kinds of language information with score values larger than or equal to a preset language score threshold value;
the original subtitle file is treated as a bilingual mixed subtitle file.
For example, the total score of each language contained in the original subtitle file is 100 points, and if the preset language score threshold is 30 points, if the score of each language contained in a certain original subtitle file is greater than 30 points, the original subtitle file can be used as a bilingual mixed subtitle file.
In order to make the bilingual corpus extracted from the bilingual mixed-caption file more accurate, the first embodiment of the present application may further include: and preprocessing the original subtitle file to obtain a preprocessed subtitle file.
The method for screening out the bilingual mixed subtitle file from the original subtitle file comprises the following steps:
and screening out bilingual mixed subtitle files from the preprocessed subtitle files.
The method comprises the following steps of preprocessing an original subtitle file, including: impurity data contained in the original subtitle file, such as non-important information of a speaker, a subtitle source, a time axis and the like, are removed; performing language identification on a text of an original subtitle file, wherein the language identification refers to identifying the language of the text for a section of text; acquiring a file name of an original subtitle file; performing complex and simplified conversion on the text contained in the original subtitle file; deleting the original subtitle file which does not accord with the preset language direction, for example, if the platform does not support the original subtitle file from Japanese to Chinese, deleting the original subtitle file containing the Japanese and Chinese languages; and acquiring file identification information of the original subtitle file and the like.
Still using the above example, after the original subtitle file is preprocessed, the preprocessed subtitle file is obtained as follows:
dog is X
"Y is the one with the tail."
O me want to see this
God,I want to see this.
Find the desired mollewis
Finding everything okay,louis?
...
Hiccup is the bed bottom of I's sleep in Jie xi Ka
Uh,it's just...I found some tapes
Some video discs are found
Under jessica's side of the bed.
All of which are half-nude actors
Shirtless-men tapes.
Wherein, Y is English translation of X.
As shown in fig. 1, in step S102, subtitle texts with aligned first language subtitle texts and second language subtitle texts are generated according to the bilingual mixed subtitle file.
The generating of the caption text aligned with the first language caption text and the second language caption text according to the bilingual mixed caption file includes:
generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file; wherein the subtitle text pair includes a first language subtitle text file and a second language subtitle text file.
The aligned caption text pair may refer to two caption text files of different languages generated from the same bilingual mixed caption file. For example, if a bilingual mixed-caption file contains a chinese text and an english text, the aligned pair of caption texts may contain a chinese caption text file and an english caption text file, both of which are generated from the same bilingual mixed-caption file.
The alignment granularity of the aligned subtitle text may include: short sentence, long sentence, paragraph. I.e. the aligned subtitle text may be aligned in short sentence alignment, long sentence alignment, or paragraph alignment.
A process of generating a subtitle text pair with aligned first language subtitle text and second language subtitle text from the bilingual mixed subtitle file will be described below with reference to fig. 2, please refer to step S102-1 to step S102-3.
As shown in fig. 2, in step S102-1, the bilingual mixed subtitle file is subjected to line splitting processing, and a line record is generated for each line text in the bilingual mixed subtitle file.
The line recording information may include: line recording the identification information of the bilingual mixed subtitle file to which the line recording belongs; -textual information; line number information of the line record.
As an embodiment, when the bilingual mixed subtitle file is subjected to line splitting processing to generate the line record, Map operation may be performed using MapReduce to perform line splitting processing on the bilingual mixed subtitle file, and the line number information of the line record (i.e., the line number information of the line text of the line record in the bilingual mixed subtitle file) may be generated with the file identification information of the bilingual mixed subtitle file where the line text is located as a main key. The format of the generated line record may include: uid (file identification information of a bilingual mixed subtitle file in which the line text is located), line number of line record, line text, media number, subtitle file name, and the like.
Still continuing with the above example, the result of performing line division processing on the bilingual mixed-caption file is:
1. dog is X
2."Y is the one with the tail."
3. O me want to see this
4.God,I want to see this.
5. Find the desired mollewis
6.Finding everything okay,louis?
7. Hiccup is the bed bottom of I's sleep in Jie xi Ka
8.Uh,it's just...I found some tapes
9. Some video discs are found
10.Under jessica's side of the bed.
11. All of which are half-nude actors
12.Shirtless-men tapes.
As shown in fig. 2, in step S102-2, the line texts in the line record are merged according to the language information of the line texts in the line record.
The merging the line texts in the line records according to the language information of the line texts in the line records includes:
storing the line texts in the line records of the first language belonging to the same bilingual mixed subtitle file in a first array according to the language information of the line texts in the line records;
and storing the line texts in the line records of the second language belonging to the same bilingual mixed subtitle file in a second array according to the language information of the line texts in the line records.
Because the main key generated in the map stage is the file identification information of the bilingual mixed subtitle file in which the line text is located, each reducer can process the line of the same file identification information in the reduce stage, the line text corresponding to two languages and the line number corresponding to the line text can be stored by using an array, and after the text processing of all the lines under the same file identification information is finished, the line texts in the array are sorted according to the line numbers to generate aligned text pairs.
As an implementation manner, the first embodiment of the present application may further include: and determining the language to which the line text in the line record belongs according to the content of the line text in the line record.
The determining the language to which the line text in the line record belongs according to the content of the line text in the line record includes:
obtaining the language to which the content contained in the line text belongs according to the content of the line text in the line record;
and taking the language with the highest score in the languages to which the contents contained in the line text belong as the language of the line text in the line record.
For example, if the line text is obtained in the language of chinese and other languages, and the score of chinese is highest, the line text in the line record is determined to be chinese.
Still using the previous example, language information of the line text in the travel record is determined as follows:
1. dog is X, the language is Chinese
"Y is the one with the tail" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
3. O me wants to see this word-Chinese
4, God, the language of I wan to se this is English
5. Find the wanted Morous Lewis-Chinese language
Finding engineering okay, louis? - - > the language is English
7. Hiccup is the bed bottom of Jixi Ka sleep I am Chinese
In Uh, it's justs
9. Find some video discs with Chinese language
10, Under jessica's side of the bed-the language is English
11. All are half-nude men's performance-the languages are Chinese
Shirtless-men tags-the language being english
After determining the language information of the line text in the travel record, merging the line text in the line record according to the language information of the line text in the line record.
As shown in fig. 2, in step S102-3, the merged line text is subjected to a sorting process to generate an aligned pair of subtitle texts.
The sorting the combined line texts to generate aligned text pairs includes:
respectively sequencing the line texts in the first array and the line texts in the second array according to the line numbers of the line texts;
generating a first language subtitle text file according to the sorted first array, and generating a second language subtitle text file according to the sorted second array;
and combining the first language subtitle text file and the second language subtitle text file to generate an aligned text pair.
Because the line texts stored in the array by adopting the MapReduce mechanism are unordered generally, the line texts in the array need to be sorted according to the line numbers to generate aligned text pairs. After the line texts in the array are sorted, a first language caption text file can be generated according to the sorted first array, a second language caption text file can be generated according to the sorted second array, and the two text files form an aligned text pair.
Still following the previous example, aligned text pairs are generated as follows:
chinese chapters (first language subtitle text file):
dog is X
O me want to see this
Find the desired mollewis
Hiccup is the bed bottom of I's sleep in Jie xi Ka
Some video discs are found
All of which are half-nude actors
English discourse (second language subtitle text file):
"Y is the one with the tail."
God,I want to see this.
Finding everything okay,louis?
Uh,it's just...I found some tapes
Under jessica's side of the bed.
Shirtless-men tapes.
as shown in fig. 1, in step S103, sentence pair extraction is performed according to the aligned caption texts, and the extracted sentence pairs are used as bilingual corpus.
And performing sentence pair extraction according to the aligned caption texts, wherein the sentence pair extraction is performed according to the aligned text pairs.
The sentence pair extraction according to the aligned text pair refers to extracting aligned sentences from two caption text files contained in the aligned text pair, namely extracting sentences with the same order from the first language caption text file and the second language caption text file respectively, and forming one sentence pair by the two extracted sentences with the same order. For example, a first sentence is extracted from the first language subtitle text file, and a first sentence is also extracted from the second language subtitle text file, and the two sentences form a sentence pair.
The bilingual corpus refers to texts written in two different languages and having a translation relationship with each other.
Still continuing with the previous example, the extracted pairs of sentences are as follows:
sentence pair one: the dog is the dog X, the is the one with the tail "
Sentence two, I want to see this God, I wan to se this.
Sentence pair three: find the desired mollewis found experiencing okay, louis?
Sentence pair four: hiccup is that I found a Uh, it's just, I found in the world of Shirtless-men that some discs are half-nude men at the bed bottom of Jixi Ka sleep.
The sentence pair extraction according to the aligned caption text and taking the extracted sentence pair as bilingual corpus comprises:
extracting sentence pairs from the aligned caption texts;
judging whether the difference of the time axis information of two sentences contained in the sentence pair is greater than a time axis difference threshold value or not, and if so, deleting the sentence pair;
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
Because the extracted sentence pairs are not very accurate in some cases if directly used as bilingual corpus, in order to improve the quality of the bilingual corpus, the sentence pairs can be filtered by combining structural information such as time axis information and the like, so that the bilingual corpus is ensured to be aligned sentence pairs.
An example of generating bilingual corpus from an original subtitle file using the method according to the first embodiment of the present application will be described below with reference to fig. 3. As shown in fig. 3, in step S301, performing subtitle preprocessing on the original subtitle file (including removing invalid subtitle content, text language identification, obtaining file name, converting from traditional chinese to simplified chinese, and filtering in language direction); in step S302, performing a mixed caption alignment operation to generate a text pair, including the following sub-steps: screening bilingual mixed subtitles, processing the bilingual mixed subtitles in a line, identifying the language of each line, combining the line texts according to the language and generating a text pair (subtitle pair1.. subtitle pair n); in step S303, sentence alignment processing is performed, including: sentence pair extraction, sentence pair filtering and sentence pair scoring are carried out in several sub-steps, and bilingual corpus is generated through steps S301-S303.
Thus, the first embodiment of the present application has been described, and the first embodiment of the present application provides a big data processing scheme using MapReduce, which combines nlp (natural language processing) and an alignment method of structured information to perform bilingual corpus extraction for a bilingual mixed subtitle file. According to the first embodiment of the application, aligned caption texts are obtained according to the bilingual mixed caption files, sentence pairs are extracted from the aligned caption texts, and the sentence pairs are filtered to be used as bilingual corpora, so that the bilingual corpora are obtained from the bilingual mixed caption files, the problem of one-to-one, one-to-many, many-to-one and many-to-many corpus alignment scenes is solved, and the generated bilingual corpora have higher quality than the bilingual corpora obtained from the monolingual caption files.
Corresponding to the method for generating bilingual corpus provided in the first embodiment of the present application, a second embodiment of the present application further provides a device for generating bilingual corpus.
As shown in fig. 4, the apparatus for generating bilingual corpus includes:
a bilingual mixed-caption file obtaining unit 401, configured to obtain a bilingual mixed-caption file; the bilingual mixed subtitle file comprises a text in a first language and a text in a second language;
an aligned caption text generating unit 402, configured to generate a caption text in which the first language caption text and the second language caption text are aligned according to the bilingual mixed caption file;
a bilingual corpus generating unit 403, configured to perform sentence pair extraction according to the aligned subtitle text, and use the extracted sentence pair as a bilingual corpus.
Optionally, the aligned subtitle text generating unit is specifically configured to:
generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file; wherein the subtitle text pair includes a first language subtitle text file and a second language subtitle text file.
Optionally, the bilingual mixed-caption file obtaining unit is specifically configured to:
obtaining an original subtitle file;
and screening out bilingual mixed subtitle files from the original subtitle files.
Optionally, the bilingual mixed-caption file obtaining unit is specifically configured to:
if the original subtitle file contains two kinds of language information with score values larger than or equal to a preset language score threshold value;
the original subtitle file is treated as a bilingual mixed subtitle file.
Optionally, the apparatus further comprises:
the preprocessing unit is used for preprocessing the original subtitle file to obtain a preprocessed subtitle file;
the bilingual mixed-caption file obtaining unit is specifically configured to:
and screening out bilingual mixed subtitle files from the preprocessed subtitle files.
Optionally, the preprocessing unit is specifically configured to:
removing impurity data contained in the original subtitle file;
performing language identification on the text of the original subtitle file;
acquiring a file name of an original subtitle file;
performing complex and simplified conversion on the text contained in the original subtitle file;
and deleting the original subtitle file which does not accord with the preset language direction.
Optionally, the aligned subtitle text generating unit is specifically configured to:
performing line division processing on the bilingual mixed subtitle file, and generating a line record for each line text in the bilingual mixed subtitle file;
merging the line texts in the line records according to the language information of the line texts in the line records;
and sequencing the combined line texts to generate aligned caption text pairs.
Optionally, the aligned subtitle text generating unit is specifically configured to:
storing the line texts in the line records of the first language belonging to the same bilingual mixed subtitle file in a first array according to the language information of the line texts in the line records;
and storing the line texts in the line records of the second language belonging to the same bilingual mixed subtitle file in a second array according to the language information of the line texts in the line records.
Optionally, the aligned subtitle text generating unit is specifically configured to:
respectively sequencing the line texts in the first array and the line texts in the second array according to the line numbers of the line texts;
generating a first language subtitle text file according to the sorted first array, and generating a second language subtitle text file according to the sorted second array;
and combining the first language subtitle text file and the second language subtitle text file to generate an aligned text pair.
Optionally, the apparatus further comprises: and the language determining unit is used for determining the language to which the line text in the line record belongs according to the content of the line text in the line record.
Optionally, the language determining unit is specifically configured to:
obtaining the language to which the content contained in the line text belongs according to the content of the line text in the line record;
and taking the language with the highest score in the languages to which the contents contained in the line text belong as the language of the line text in the line record.
Optionally, the information recorded in the row includes:
the line records the identification information of the bilingual mixed subtitle file to which the line belongs;
-textual information;
the line number information of the line record.
Optionally, the bilingual corpus generating unit is specifically configured to:
extracting sentence pairs from the aligned caption texts;
and judging whether the difference of the time axis information of the two sentences contained in the sentence pair is greater than a time axis difference threshold value, if so, deleting the sentence pair.
Optionally, the bilingual corpus generating unit is further configured to:
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
It should be noted that, for the detailed description of the apparatus provided in the second embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not described here again.
Corresponding to the method for generating bilingual corpus provided in the first embodiment of the present application, a third embodiment of the present application further provides an electronic device.
As shown in fig. 5, the electronic device includes:
a processor 501; and
a memory 502 for storing a program of a bilingual corpus generating method, wherein the following steps are executed after the device is powered on and the program of the bilingual corpus generating method is executed by the processor:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
Optionally, the generating of the subtitle text aligned between the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file includes:
generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file; wherein the subtitle text pair includes a first language subtitle text file and a second language subtitle text file.
Optionally, the obtaining the bilingual mixed subtitle file includes:
obtaining an original subtitle file;
and screening out bilingual mixed subtitle files from the original subtitle files.
Optionally, the screening of the bilingual mixed subtitle file from the original subtitle file includes
If the original subtitle file contains two kinds of language information with score values larger than or equal to a preset language score threshold value;
the original subtitle file is treated as a bilingual mixed subtitle file.
Optionally, the electronic device further performs the following steps:
preprocessing the original subtitle file to obtain a preprocessed subtitle file;
the method for screening out the bilingual mixed subtitle file from the original subtitle file comprises the following steps:
and screening out bilingual mixed subtitle files from the preprocessed subtitle files.
Optionally, the preprocessing the original subtitle file includes at least one of the following processing:
removing impurity data contained in the original subtitle file;
performing language identification on the text of the original subtitle file;
acquiring a file name of an original subtitle file;
performing complex and simplified conversion on the text contained in the original subtitle file;
and deleting the original subtitle file which does not accord with the preset language direction.
Optionally, the generating a caption text pair in which the first language caption text and the second language caption text are aligned according to the bilingual mixed caption file includes:
performing line division processing on the bilingual mixed subtitle file, and generating a line record for each line text in the bilingual mixed subtitle file;
merging the line texts in the line records according to the language information of the line texts in the line records;
and sequencing the combined line texts to generate aligned caption text pairs.
Optionally, the merging the line texts in the line record according to the language information of the line texts in the line record includes:
storing the line texts in the line records of the first language belonging to the same bilingual mixed subtitle file in a first array according to the language information of the line texts in the line records;
and storing the line texts in the line records of the second language belonging to the same bilingual mixed subtitle file in a second array according to the language information of the line texts in the line records.
Optionally, the sorting the combined line texts to generate aligned text pairs includes:
respectively sequencing the line texts in the first array and the line texts in the second array according to the line numbers of the line texts;
generating a first language subtitle text file according to the sorted first array, and generating a second language subtitle text file according to the sorted second array;
and combining the first language subtitle text file and the second language subtitle text file to generate an aligned text pair.
Optionally, the electronic device further performs the following steps: and determining the language to which the line text in the line record belongs according to the content of the line text in the line record.
Optionally, the determining, according to the content of the line text in the line record, the language to which the line text in the line record belongs includes:
obtaining the language to which the content contained in the line text belongs according to the content of the line text in the line record;
and taking the language with the highest score in the languages to which the contents contained in the line text belong as the language of the line text in the line record.
Optionally, the information recorded in the row includes:
the line records the identification information of the bilingual mixed subtitle file to which the line belongs;
-textual information;
the line number information of the line record.
Optionally, the performing sentence pair extraction according to the aligned subtitle text, and using the extracted sentence pair as a bilingual corpus includes:
extracting sentence pairs from the aligned caption texts;
and judging whether the difference of the time axis information of the two sentences contained in the sentence pair is greater than a time axis difference threshold value, if so, deleting the sentence pair.
Optionally, the electronic device further performs the following operations:
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
It should be noted that, for the detailed description of the electronic device provided in the third embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not repeated here.
Corresponding to the bilingual corpus generating method provided in the first embodiment of the present application, a fourth embodiment of the present application further provides a storage device, in which a program of the bilingual corpus generating method is stored, where the program is executed by a processor to perform the following steps:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
It should be noted that, for the detailed description of the storage device provided in the fourth embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not described here again.
A fifth embodiment of the present application provides a text processing system, which includes a subtitle file preprocessing module, a bilingual mixed subtitle file screening module, a subtitle text pair generating module, and a bilingual corpus generating module.
And the subtitle file preprocessing module is used for preprocessing the original subtitle file to obtain the preprocessed subtitle file.
The preprocessing of the original subtitle file comprises the following steps: impurity data contained in the original subtitle file, such as non-important information of a speaker, a subtitle source, a time axis and the like, are removed; performing language identification on a text of an original subtitle file, wherein the language identification refers to identifying the language of the text for a section of text; acquiring a file name of an original subtitle file; performing complex and simplified conversion on the text contained in the original subtitle file; deleting the original subtitle file which does not accord with the preset language direction, for example, if the platform does not support the original subtitle file from Japanese to Chinese, deleting the original subtitle file containing the Japanese and Chinese languages; and acquiring file identification information of the original subtitle file and the like.
And the bilingual mixed subtitle file screening module is used for screening the bilingual mixed subtitle files from the preprocessed subtitle files.
And the subtitle text pair generating module is used for generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file.
For a specific implementation of generating a subtitle text pair with aligned first language subtitle text and second language subtitle text according to the bilingual mixed subtitle file, reference may be made to the related description of the first embodiment of the present application.
And the bilingual corpus generation module is used for performing sentence pair extraction according to the aligned caption texts and taking the extracted sentence pairs as bilingual corpus.
The bilingual corpus generating module is specifically configured to:
extracting sentence pairs from the aligned caption text pairs;
judging whether the difference of the time axis information of two sentences contained in the sentence pair is greater than a time axis difference threshold value or not, and if so, deleting the sentence pair;
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
It should be noted that, for the detailed description of the text processing system provided in the fifth embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not described here again.
A sixth embodiment of the present application provides a method for processing a subtitle file, including:
acquiring a bilingual mixed subtitle file sent by a client; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
sentence pair extraction is carried out according to the aligned caption texts, and the extracted sentence pairs are used as bilingual corpus;
and returning the bilingual corpus to the client.
The client may be a client for playing movie or television program, or a client for intelligent translation, and the sixth embodiment of the present application provides a method for processing a subtitle file to process a bilingual mixed subtitle file sent by the client, so that the client obtains aligned bilingual corpus with high quality.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (21)

1. A method for generating bilingual corpus, comprising:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
2. The method of claim 1, wherein generating aligned subtitle text for a first language subtitle text and a second language subtitle text from the bilingual mixed subtitle file comprises:
generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file; wherein the subtitle text pair includes a first language subtitle text file and a second language subtitle text file.
3. The method of claim 1, wherein obtaining the bilingual mixed-caption file comprises:
obtaining an original subtitle file;
and screening out bilingual mixed subtitle files from the original subtitle files.
4. The method of claim 3, wherein the screening of the bilingual mixture subtitle file from the original subtitle file comprises:
if the original subtitle file contains two kinds of language information with score values larger than or equal to a preset language score threshold value;
the original subtitle file is treated as a bilingual mixed subtitle file.
5. The method of claim 3, further comprising:
preprocessing the original subtitle file to obtain a preprocessed subtitle file;
the method for screening out the bilingual mixed subtitle file from the original subtitle file comprises the following steps:
and screening out bilingual mixed subtitle files from the preprocessed subtitle files.
6. The method of claim 5, wherein the pre-processing of the original subtitle file comprises at least one of:
removing impurity data contained in the original subtitle file;
performing language identification on the text of the original subtitle file;
acquiring a file name of an original subtitle file;
performing complex and simplified conversion on the text contained in the original subtitle file;
and deleting the original subtitle file which does not accord with the preset language direction.
7. The method of claim 2, wherein generating aligned subtitle text pairs for a first language subtitle text and a second language subtitle text from the bilingual mixed subtitle file comprises:
performing line division processing on the bilingual mixed subtitle file, and generating a line record for each line text in the bilingual mixed subtitle file;
merging the line texts in the line records according to the language information of the line texts in the line records;
and sequencing the combined line texts to generate aligned caption text pairs.
8. The method according to claim 7, wherein the merging the line texts in the line records according to the language information of the line texts in the line records comprises:
storing the line texts in the line records of the first language belonging to the same bilingual mixed subtitle file in a first array according to the language information of the line texts in the line records;
and storing the line texts in the line records of the second language belonging to the same bilingual mixed subtitle file in a second array according to the language information of the line texts in the line records.
9. The method of claim 8, wherein the sorting the merged line text to generate aligned text pairs comprises:
respectively sequencing the line texts in the first array and the line texts in the second array according to the line numbers of the line texts;
generating a first language subtitle text file according to the sorted first array, and generating a second language subtitle text file according to the sorted second array;
and combining the first language subtitle text file and the second language subtitle text file to generate an aligned text pair.
10. The method of claim 7, further comprising: and determining the language to which the line text in the line record belongs according to the content of the line text in the line record.
11. The method according to claim 10, wherein the determining the language to which the line text in the line record belongs according to the content of the line text in the line record comprises:
obtaining the language to which the content contained in the line text belongs according to the content of the line text in the line record;
and taking the language with the highest score in the languages to which the contents contained in the line text belong as the language of the line text in the line record.
12. The method of claim 7, wherein the information recorded in the row comprises:
the line records the identification information of the bilingual mixed subtitle file to which the line belongs;
-textual information;
the line number information of the line record.
13. The method of claim 1, wherein the extracting sentence pairs according to the aligned caption text and using the extracted sentence pairs as bilingual corpus comprises:
extracting sentence pairs from the aligned caption texts;
and judging whether the difference of the time axis information of the two sentences contained in the sentence pair is greater than a time axis difference threshold value, if so, deleting the sentence pair.
14. The method of claim 13, further comprising:
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
15. The method of claim 1, wherein the alignment granularity of the aligned subtitle text comprises at least one of:
short sentence, long sentence, paragraph.
16. An apparatus for generating bilingual corpus, comprising:
a bilingual mixed subtitle file obtaining unit for obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a text in a first language and a text in a second language;
the aligned caption text generating unit is used for generating a caption text aligned with the first language caption text and the second language caption text according to the bilingual mixed caption file;
and the bilingual corpus generating unit is used for extracting sentence pairs according to the aligned caption texts and taking the extracted sentence pairs as bilingual corpora.
17. An electronic device, comprising:
a processor; and
a memory for storing a program of a bilingual corpus generating method, the device being powered on and executing the program of the bilingual corpus generating method by the processor, and then performing the following steps:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
18. A storage device, characterized in that,
a program storing a method for generating bilingual corpus, the program being executed by a processor to perform the steps of:
obtaining a bilingual mixed subtitle file; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and extracting sentence pairs according to the aligned caption texts, and taking the extracted sentence pairs as bilingual linguistic data.
19. A text processing system, comprising:
the subtitle file preprocessing module is used for preprocessing the original subtitle file to obtain a preprocessed subtitle file;
the bilingual mixed subtitle file screening module is used for screening the bilingual mixed subtitle files from the preprocessed subtitle files;
the subtitle text pair generating module is used for generating a subtitle text pair aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
and the bilingual corpus generation module is used for performing sentence pair extraction according to the aligned caption text pairs and taking the extracted sentence pairs as bilingual corpus.
20. The system of claim 18, wherein the bilingual corpus generation module is specifically configured to:
extracting sentence pairs from the aligned caption text pairs;
judging whether the difference of the time axis information of two sentences contained in the sentence pair is greater than a time axis difference threshold value or not, and if so, deleting the sentence pair;
if not, calculating the similarity of the two sentences contained in the sentence pair, and taking the sentence pair with the similarity larger than a preset similarity threshold value as the bilingual corpus.
21. A subtitle file processing method is characterized by comprising the following steps:
acquiring a bilingual mixed subtitle file sent by a client; the bilingual mixed subtitle file comprises a first language subtitle text and a second language subtitle text;
generating a subtitle text aligned with the first language subtitle text and the second language subtitle text according to the bilingual mixed subtitle file;
sentence pair extraction is carried out according to the aligned caption texts, and the extracted sentence pairs are used as bilingual corpus;
and returning the bilingual corpus to the client.
CN201910733742.8A 2019-08-09 2019-08-09 Bilingual corpus generation method, text processing system and subtitle file processing method Pending CN112347755A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910733742.8A CN112347755A (en) 2019-08-09 2019-08-09 Bilingual corpus generation method, text processing system and subtitle file processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910733742.8A CN112347755A (en) 2019-08-09 2019-08-09 Bilingual corpus generation method, text processing system and subtitle file processing method

Publications (1)

Publication Number Publication Date
CN112347755A true CN112347755A (en) 2021-02-09

Family

ID=74367609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910733742.8A Pending CN112347755A (en) 2019-08-09 2019-08-09 Bilingual corpus generation method, text processing system and subtitle file processing method

Country Status (1)

Country Link
CN (1) CN112347755A (en)

Similar Documents

Publication Publication Date Title
Pavel et al. Video digests: a browsable, skimmable format for informational lecture videos.
US8155969B2 (en) Subtitle generation and retrieval combining document processing with voice processing
US11636273B2 (en) Machine-assisted translation for subtitle localization
JP2008148121A (en) Motion picture summary automatic generation apparatus and method, and computer program
Kenny Human and machine translation
Karakanta et al. MuST-cinema: a speech-to-subtitles corpus
US20140214402A1 (en) Implementation of unsupervised topic segmentation in a data communications environment
CN107807939B (en) Data object sorting method and device
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
CN110889266A (en) Conference record integration method and device
Cripwell et al. Simplicity Level Estimate (SLE): A Learned Reference-Less Metric for Sentence Simplification
Li et al. Visa: An ambiguous subtitles dataset for visual scene-aware machine translation
CN102103586B (en) Multi-language bidirectionally mixed-arranged caption edition system
CN112347755A (en) Bilingual corpus generation method, text processing system and subtitle file processing method
CN113591491A (en) System, method, device and equipment for correcting voice translation text
CN116861242A (en) Language perception multi-language pre-training and fine tuning method based on language discrimination prompt
RU2668721C1 (en) Method for displaying subtitles in the process of playing media content (options)
CN115719073A (en) Translation method, device and medium for multilingual resources
Kendall Data preservation and access
CN112104917A (en) Single-bilingual subtitle modification searching processing method and system
Istiqomah et al. Discursive creation technique of English to Indonesian subtitle in Harry Potter: The chamber of secrets movie
US20230359837A1 (en) Multilingual summarization of episodes using longformers
US20240048821A1 (en) System and method for generating a synopsis video of a requested duration
CN102104743A (en) Method and device for editing multi-language hybrid arranged captions
US12008038B2 (en) Summarization of video artificial intelligence method, system, and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination