CN113657381A

CN113657381A - Subtitle generating method, device, computer equipment and storage medium

Info

Publication number: CN113657381A
Application number: CN202110951249.0A
Authority: CN
Inventors: 郭晋; 段恒昌; 郑伟强
Original assignee: Beijing Lexuebang Network Technology Co ltd
Current assignee: Beijing Lexuebang Network Technology Co ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-11-16

Abstract

The present disclosure provides a subtitle generating method, apparatus, computer device, and storage medium, the method comprising: acquiring target audio of a subtitle to be generated and a standard text corresponding to the target audio; generating a corresponding reference subtitle file based on the target audio, wherein the reference subtitle file comprises a subtitle text and display time of the subtitle text; determining whether the subtitle text is different from the standard text, if so, determining difference information, and adjusting the subtitle text based on the difference information to obtain a target subtitle text; and determining the target display time of the target subtitle text based on the display time of the subtitle text to obtain a target subtitle file containing the target subtitle text and the target display time. Therefore, the problems in the subtitle text can be timely determined, and the subtitle text can be further adjusted according to the difference information of the subtitle text relative to the standard text, so that the correct subtitle text is obtained, and the accuracy of the subtitle file is greatly improved.

Description

Subtitle generating method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of audio recognition technology, and in particular, to a subtitle generating method, apparatus, computer device, and storage medium.

Background

In order to visually display the content represented by the audio-video, the audio-video is usually configured with corresponding subtitles. The prior art provides a scheme for generating subtitles corresponding to audio and video based on a voice recognition technology, however, due to the limitation of recognition accuracy, the subtitles generated based on the voice recognition may have errors such as few words, many words, wrongly written words, and the like. Therefore, how to timely find and correct the wrong content in the subtitles generated based on the speech recognition becomes a problem which needs to be solved urgently.

Disclosure of Invention

The embodiment of the disclosure at least provides a subtitle generating method, a subtitle generating device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for generating a subtitle, where the method includes:

acquiring a target audio of a subtitle to be generated and a standard text corresponding to the target audio;

generating a corresponding reference subtitle file based on the target audio, wherein the reference subtitle file comprises subtitle texts and display time of the subtitle texts;

determining whether the subtitle text is different from the standard text or not, if so, determining difference information, and adjusting the subtitle text based on the difference information to obtain a target subtitle text;

and determining the target display time of the target subtitle text based on the display time of the subtitle text to obtain a target subtitle file containing the target subtitle text and the target display time.

In one possible implementation, the generating a corresponding subtitle file based on the target audio includes:

generating corresponding subtitle texts based on the target audio;

determining the display time of each word in the subtitle text based on the playing time of the audio clip corresponding to each word in the subtitle text in the target audio;

and generating a reference subtitle file based on the subtitle text and the display time.

In a possible implementation manner, the determining the difference information includes:

determining position information of difference content between the subtitle text and the standard text in the subtitle text and difference type information corresponding to the difference content;

wherein the difference type information includes at least one of missing information, redundant information, and error information.

In one possible embodiment, the method further comprises:

generating difference prompt information according to the position information and the difference type information;

and prompting the user according to the difference prompting information.

In a possible implementation manner, the prompting the user according to the difference prompt information includes:

displaying a prompt text corresponding to the difference prompt information;

and/or playing a prompt audio/video corresponding to the difference prompt information.

In one possible embodiment, the method further comprises:

displaying the subtitle text;

and displaying the difference content in the subtitle text based on the display form corresponding to the difference type information and the position information.

In one possible implementation, the displaying the difference content in the subtitle text based on the display form corresponding to the difference type information and the position information includes:

determining a position lacking characters in the subtitle text according to the position information and displaying a preset symbol at the position lacking characters aiming at the condition that the difference type information is the missing information;

determining the position of redundant characters in the subtitle text according to the position information and displaying the redundant characters in a first display form aiming at the condition that the difference type information is the redundant information;

and determining the position of the wrong word in the subtitle text according to the position information and displaying the wrong word in a second display form aiming at the condition that the difference type information is the error information.

In one possible implementation, the adjusting the subtitle text based on the disparity information includes:

determining the position of the text lacking characters in the subtitle text according to the position information and adding corresponding characters at the position of the text lacking characters according to the missing information under the condition that the difference type information is the missing information;

determining the position of redundant characters in the subtitle text according to the position information and deleting corresponding characters at the position of the redundant characters according to the redundant information under the condition that the difference type information is the redundant information;

and determining the position of the wrong character in the subtitle text according to the position information and correcting the corresponding character at the position of the wrong character according to the error information under the condition that the difference type information is the error information.

In a second aspect, an embodiment of the present disclosure provides a subtitle generating apparatus, including:

the data acquisition module is used for acquiring a target audio of a subtitle to be generated and a standard text corresponding to the target audio;

a reference file generation module, configured to generate a corresponding reference subtitle file based on the target audio, where the reference subtitle file includes a subtitle text and a display time of the subtitle text;

the target text generation module is used for determining whether the subtitle text is different from the standard text or not, if so, determining difference information, and adjusting the subtitle text based on the difference information to obtain a target subtitle text;

and the target file generation module is used for determining the target display time of the target subtitle text based on the display time of the subtitle text to obtain the target subtitle file containing the target subtitle text and the target display time.

In a possible implementation manner, when the reference file generating module is configured to generate a corresponding subtitle file based on the target audio, the reference file generating module is specifically configured to:

generating corresponding subtitle texts based on the target audio;

In a possible implementation manner, when the target text generation module is configured to determine the difference information, the target text generation module is specifically configured to:

In a possible implementation, the apparatus further includes a first information presentation module, and the first information presentation module is configured to:

and prompting the user according to the difference prompting information.

In a possible implementation manner, when the first information presentation module is configured to prompt the user according to the difference prompt information, the first information presentation module is specifically configured to:

displaying a prompt text corresponding to the difference prompt information;

In a possible implementation, the apparatus further includes a second information presentation module, and the second information presentation module is configured to:

displaying the subtitle text;

In a possible implementation manner, when the second information presentation module is configured to display the difference content in the subtitle text based on the display form corresponding to the difference type information and the position information, the second information presentation module is specifically configured to:

In a possible implementation manner, the target text generation module, when configured to adjust the subtitle text based on the difference information, is specifically configured to:

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any one of the possible subtitle generating methods of the first aspect.

In a fourth aspect, this disclosed embodiment further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the first aspect described above, or any one of the possible subtitle generating methods in the first aspect.

According to the subtitle generating method and device, the computer device and the storage medium provided by the embodiment of the disclosure, after the reference subtitle file of the target audio is obtained, the problem existing in the subtitle text is timely determined by comparing the difference between the subtitle text of the reference subtitle file and the standard text corresponding to the target audio, and the subtitle text can be further adjusted according to the difference information of the subtitle text relative to the standard text, so that the correct subtitle text is obtained, and the accuracy of the subtitle file is greatly improved.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 is a flowchart of a subtitle generating method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a subtitle generating apparatus according to an embodiment of the present disclosure;

fig. 3 is a second schematic diagram of a subtitle generating apparatus according to an embodiment of the present disclosure;

fig. 4 is a third schematic diagram of a subtitle generating apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

It is found that in order to visually display the content represented by the audio, the audio is usually configured with corresponding subtitles. The related art provides a scheme for generating subtitles corresponding to audio based on a voice recognition technology, however, due to a limitation in recognition accuracy, the subtitles generated based on the voice recognition may have errors. Therefore, how to timely find and correct the wrong content in the subtitles generated based on the speech recognition becomes a problem which needs to be solved urgently.

Based on the research, the disclosure provides a subtitle generating method, after a reference subtitle file of a target audio is obtained, the difference between a subtitle text of the reference subtitle file and a standard text corresponding to the target audio is compared to determine the problem existing in the subtitle text in time, and the subtitle text can be further adjusted according to the difference information of the subtitle text relative to the standard text, so that the correct subtitle text is obtained, and the accuracy of the subtitle file is greatly improved.

To facilitate understanding of the present embodiment, first, a subtitle generating method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the subtitle generating method provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: terminal devices, which may be User Equipment (UE), mobile devices, User terminal devices, cellular phones, cordless phones, Personal Digital Assistants (PDAs), handheld devices, computing devices, vehicle mounted devices, wearable devices, and the like, servers, and other processing devices. In some possible implementations, the subtitle generating method may be implemented by a processor calling computer readable instructions stored in a memory.

The following describes a subtitle generating method provided by the embodiments of the present disclosure by taking an execution subject as a terminal device as an example. Referring to fig. 1, a flowchart of a subtitle generating method provided in an embodiment of the present disclosure is shown, where the method includes steps S110 to S130, where:

s110: and acquiring target audio of the subtitle to be generated and a standard text corresponding to the target audio.

In this step, it is determined that the text content included in the standard text is consistent with the speech content included in the target audio. If the standard texts are inconsistent, the standard texts are indicated to have errors in positioning, a prompt is initiated, and the corresponding standard texts are pulled from the database again.

The target audio may be a teaching material audio provided by an education institution for assisting learning, for example, the target audio may be poetry audio, prose audio, foreign language audio, lesson audio, and songbook audio. For example, if the speech content included in the target audio is "white-day-mountain-best yellow river inflow," the text content included in the standard text should also be "white-day-mountain-best yellow river inflow.

Alternatively, the speakable audio may be targeted by recording the speakable audio for standard text by a designated person (e.g., a professional speaker). The target audio can be stored in a preset position of the terminal device in advance, and the terminal device acquires the target audio at the preset position; the terminal equipment can record reading audio of a user (such as a professional reader) aiming at the standard text in real time to obtain target audio; the target audio may also be pre-stored in a designated device (e.g., a server) other than the terminal device, and the terminal device may download the target audio at the designated device.

In some possible embodiments, the target audio may also be an audio recorded by a user (e.g., a student or a student) according to a standard text, so that evaluation, analysis, and the like of the pronunciation of the user need to be implemented in the following, which is not described herein again.

Optionally, the standard text may be stored in a preset position of the terminal device in advance, and the terminal device obtains the standard text at the preset position; the terminal equipment can respond to the input operation of a user aiming at the standard text to obtain the standard text; the standard text may also be pre-stored in a designated device (e.g., a server) other than the terminal device, where the standard text may be downloaded by the terminal device.

S120: based on the target audio, a corresponding subtitle file is generated.

The subtitle file may be a Lyric (LRC) file, and the reference subtitle file may include subtitle text and a display time of the subtitle text. In the disclosed embodiment, the terminal device can determine the display time of each word in the subtitle text at the granularity of a single word for the display time of the subtitle text.

It should be noted that the display time of each word includes the display time point and the duration of the word, for example, the display time points are: 2 minutes, 3 seconds and 5 centimeters of seconds, and the duration is 3 seconds and 1 centimeter of seconds, which is not described in detail. Of course, it is also possible to only include the display time point of the word, and then use the time period between the display time point and the next display time point (the display time point of the next word) as the duration of the word, which is not described herein again.

Optionally, the terminal device may generate a corresponding subtitle text based on the target audio; determining the display time of each word in the subtitle text based on the playing time of the audio clip corresponding to each word in the subtitle text in the target audio; and generating a reference subtitle file based on the subtitle text and the display time.

Specifically, the terminal device may recognize the voice content included in the target audio through a voice recognition technology, and generate a corresponding subtitle text. The terminal device may use the start time of the target audio as the start time of the subtitle text, and determine the display time of a word by using the play time of the audio segment corresponding to the word in the target audio when the terminal device recognizes one word in the target audio. And obtaining a reference subtitle file corresponding to the target audio based on each recognized word in the target audio and the display time of each word. Taking the standard text corresponding to the target audio as "white-day mountain-based yellow river entering the ocean current", the format of the correct subtitle file corresponding to the target audio may be as follows:

[00:5.60] Bai [00:6.00] day [00:6.40] Yi [00:6.80] shan [00:7.20] Tou

(00: 8.00) yellow (00: 8.40) river (00: 8.80) into (00: 9.20) sea (00: 9.60)

Wherein the content in the symbol "[ ]" in the front of each word indicates the display time of the word. For example, the display time of the "white" word is 5.6 seconds after the subtitle file is played, and the display time of the "day" word is 6 seconds after the subtitle file is played.

S130: and determining whether the subtitle text is different from the standard text, if so, determining difference information, and adjusting the subtitle text based on the difference information to obtain a target subtitle text.

When the caption text is different from the standard text, the terminal device may determine the difference information and automatically adjust the caption text based on the difference information, or the terminal device may adjust the caption text in response to a modification operation of the caption text by the user. Optionally, after the subtitle text is adjusted, the display time corresponding to the subtitle text may be further adjusted. It should be understood that when the subtitle text is the same as the standard text, the reference subtitle file generated at the above-described step S120 may be determined as the target subtitle file.

In the embodiment of the present disclosure, the disparity information may include position information of the disparity content in the subtitle text and disparity type information corresponding to the disparity content. Wherein the disparity type information may include at least one of missing information, redundant information, and error information of the subtitle text.

Here, the missing information of the subtitle text may indicate which characters the subtitle text lacks compared to the standard text, and the position information may indicate a position in the subtitle text where the characters are absent. The superfluous information of the subtitle text may indicate which superfluous words are present in the subtitle text compared to the standard text, and the position information may indicate positions of the superfluous words in the subtitle text. The error information of the subtitle text may indicate which erroneous words exist in the subtitle text compared to the standard text, and the position information may indicate positions of the erroneous words in the subtitle text.

In this step, the position information and the difference type information are determined first, and then the subtitle text is adjusted based on the position information and the difference type information. The subtitle text may include a plurality of difference contents, and the difference type of each difference content may be different, and for each difference content, the subtitle text needs to be adjusted in an adjustment manner corresponding to the difference type of the difference content. It is understood that after the subtitle text is adjusted for the case that the difference type information is at least one of the missing information, the redundant information, and the error information, the target subtitle text can be obtained, and the specific steps of adjusting the subtitle text that need to be executed for each difference type information will be described below.

And determining the position of the text lacking characters in the subtitle text according to the position information and adding corresponding characters at the position of the text lacking characters according to the missing information under the condition that the difference type information is the missing information.

Taking the standard text corresponding to the target audio as "the day is the mountain-yellow river to enter the ocean current", the caption text is assumed to be "the day is the mountain-yellow river to enter the ocean current". The missing information may indicate that the subtitle text lacks "out" words compared to the standard text, and the position information indicates that the missing "out" words are between the fourth character and the fifth character (i.e., "mountain" and "yellow") in the subtitle text, and thus, the "out" words may be added between "mountain" and "yellow" of the subtitle text.

In addition, sentence-break symbols, such as commas, pause numbers or spaces, may be added to the subtitles according to the sentence-break pause in the target audio, for example, "enter the ocean stream in yellow river as much as possible in the mountain on white day", which is not described in detail herein.

And determining the positions of the redundant characters in the subtitle text according to the position information and deleting the corresponding characters at the positions of the redundant characters according to the redundant information under the condition that the difference type information is redundant information.

Taking the standard text corresponding to the target audio as "the situation that the target audio enters the ocean current in the yellow river as much as the mountain in the daytime", the subtitle text is assumed to be "the situation that the target audio enters the ocean current in the yellow river as much as the mountain in the daytime". The redundant information may indicate that there is a redundant word such as 'in the subtitle text compared to the standard text, and the position information indicates that the redundant word such as' is an eighth character in the subtitle text, and thus, the eighth character 'such as' may be deleted in the subtitle text.

And determining the position of the wrong character in the subtitle text according to the position information and correcting the corresponding character at the position of the wrong character according to the error information under the condition that the difference type information is error information.

Taking the standard text corresponding to the target audio as "white-day-mountain-yellow-river-entering ocean current", the subtitle text is assumed to be "white-day-mountain-yellow-entering ocean current". The error information may indicate that the subtitle text has an error word "and" as compared to the standard text, and the position information indicates the error word "and" as a seventh character in the subtitle text, and thus, the "and" in the subtitle text may be modified to "river".

S140: and determining the target display time of the target subtitle text based on the display time of the subtitle text to obtain a target subtitle file containing the target subtitle text and the target display time.

It is understood that when the different type information is missing information, the newly added character lacks a corresponding display time, and thus the display time of the newly added character can be further determined. In the embodiment of the present disclosure, the display time for the newly added text may be determined based on the display time of the text adjacent to the newly added text in the subtitle text, so as to obtain the corrected target display time.

Optionally, for a newly added character, the display time of a character adjacent to the newly added character may be determined, and the display time of the newly added character is obtained based on the display time of the adjacent character and a preset time adjustment amount. For example, the display time of the previous character of the newly added character may be determined, the new display time may be obtained by increasing the display time of the previous character by a preset time length, and the new display time may be used as the display time of the newly added character, so as to obtain the corrected target display time. Or, the display time of the next character after the newly added character may be determined, the newly added display time may be obtained after the display time of the next character is reduced by a preset time length, and the newly added display time may be used as the display time of the newly added character, so as to obtain the corrected target display time.

Alternatively, for a newly added character, the display time of two characters adjacent to the newly added character in front and back may be determined, and the display time of the newly added character may be determined based on the display time of the two characters adjacent to the newly added character in front and back, so as to obtain the corrected target display time. For example, the display time of the character preceding the newly added character and the display time of the character succeeding the newly added character may be determined; and taking the intermediate time between the display time of the previous character and the display time of the next character as the display time of the newly added character, thereby obtaining the corrected target display time.

Optionally, when the different type information is redundant information, after deleting the redundant text, the storage location of the display time corresponding to the redundant text may be determined, and the display time may be deleted, so as to obtain the corrected target display time.

Alternatively, when the difference type information is error information, the display time of the correct character obtained by correcting the error character may be further determined. Alternatively, the display time corresponding to the corrected wrong word in the subtitle text may be determined as the display time of the correct word, so as to obtain the corrected target display time.

In the embodiment of the present disclosure, after the adjustment of the subtitle text is completed based on the difference information, the standard text and the adjusted subtitle text may be displayed on the same interface, and the adjusted characters in the subtitle text are displayed in a preset form, for example, the adjusted characters in the subtitle text are displayed in a highlight manner, so that a user can conveniently review the adjusted subtitle text based on the standard text.

In the embodiment of the present disclosure, after determining that the subtitle text is different from the standard text, the user may be prompted in a corresponding manner that the subtitle text is different from the standard text. For example, the discrepancy cue information may be generated from the location information and the discrepancy type information; and prompting the user according to the difference prompting information. The prompting method may include: displaying a prompt text corresponding to the difference prompt information, and playing a prompt audio and video corresponding to the difference prompt information.

As described above, the disparity type information includes at least one of missing information of the subtitle text, redundant information of the subtitle text, and error information of the subtitle text. The content of the difference prompt information may include the text difference condition indicated by the difference type information and the position of the difference text indicated by the position information in the subtitle text.

Taking the difference type information as the missing information as an example, it is assumed that the voice content included in the target audio is "the yellow river enters the ocean current on the white-day mountain-based," and the subtitle text is "the yellow river enters the ocean current on the white-day mountain-based. The missing information may indicate that the subtitle text lacks "out" words compared to the standard text, and the position information indicates that the missing "out" words are between the fourth character and the fifth character (i.e., "mountain" and "yellow") in the subtitle text. Therefore, the content of the difference cue information may be: subtitle text lacks "end" words between "mountain" and "yellow".

For more flexibility and interest, the difference prompt information can be displayed in cartoon animation, audio/video or pictures, which is not described further.

In a possible implementation mode, after the difference between the caption text and the standard text is determined, the caption text can be displayed; and displaying the difference content in the subtitle text based on the display form and the position information corresponding to the difference type information. As described above, the disparity type information includes at least one of missing information of the subtitle text, redundant information of the subtitle text, and error information of the subtitle text. For each difference content, it is necessary to display the difference content in the subtitle text in a display manner corresponding to the difference type of the difference content. The caption text is displayed in the mode, so that on one hand, a user can visually see the difference content of the caption text conveniently, and on the other hand, the user can conveniently and quickly position the difference content, so that the user can adjust the caption text manually.

And determining the position of the text lacking characters in the subtitle text according to the position information aiming at the condition that the difference type information is missing information, and displaying a preset symbol at the position of the text lacking characters.

Alternatively, the preset symbols may be numbers, english letters, greek letters, and mathematical symbols or other preset designed symbols, such as but not limited to add symbols, delete symbols, and the like.

Taking the standard text corresponding to the target audio as "the day is the mountain-yellow river to enter the ocean current", the caption text is assumed to be "the day is the mountain-yellow river to enter the ocean current". The position information indicates that there is a lack of text between the fourth character and the fifth character (i.e., "mountain" and "yellow") in the subtitle text. Accordingly, the symbol "&" may be presented between "mountain" and "yellow" of the subtitle text so that the user intuitively locates a position where a character is absent in the subtitle text and adds the absent character at the position of the symbol "&".

And determining the position of the redundant characters in the subtitle text according to the position information and displaying the redundant characters in a first display form aiming at the condition that the difference type information is redundant information.

Optionally, the first display form may be, but is not limited to, displaying the text in a predetermined color, filling the predetermined color in the area where the text is located, and displaying the frame containing the text in the area where the text is located.

Taking the standard text corresponding to the target audio as "the situation that the target audio enters the ocean current in the yellow river as much as the mountain in the daytime", the subtitle text is assumed to be "the situation that the target audio enters the ocean current in the yellow river as much as the mountain in the daytime". The position information indicates that the redundant characters in the subtitle text are characters corresponding to the eighth character (i.e., "like" characters), so that the characters corresponding to the eighth character can be displayed in a text red color, so that a user can intuitively position the redundant characters in the subtitle text and delete the redundant characters.

And determining the position of the wrong character in the subtitle text according to the position information and displaying the wrong character in a second display form under the condition that the difference type information is wrong information.

Optionally, the second display form may be, but is not limited to, displaying the text in a predetermined color, filling the predetermined color in the area where the text is located, and displaying the frame containing the text in the area where the text is located.

Taking the standard text corresponding to the target audio as "white-day-mountain-yellow-river-entering ocean current", the subtitle text is assumed to be "white-day-mountain-yellow-entering ocean current". The position information indicates that the wrong word in the wrong subtitle text is the word corresponding to the seventh character, so that blue can be filled in the area where the word corresponding to the seventh character is located, so that a user can intuitively position the wrong word in the subtitle text and correct the wrong word.

In the application stage, the target subtitle text in the target subtitle file can be synchronously presented when the target audio is played. Specifically, the text content corresponding to the audio content of the currently played target audio may be displayed in real time according to the subtitle text and the target display time in the target subtitle file. Alternatively, each word of the subtitle text may also be grouped in advance, for example, the subtitle text is divided into a plurality of sentences. For each character in a sentence, when the target caption file is played to the display time of the character, the character is continuously displayed; when the target caption file is played to the display time of the first character in the next sentence of the sentence, all the characters in the sentence are displayed, and the characters in the next sentence are continuously and sequentially displayed.

Taking the example that the caption text in the target caption file is ' every day best mountain yellow river incoming ocean current ', the caption text is divided into two sentences, wherein the first sentence is ' every day best mountain, and the second sentence is ' yellow river incoming ocean current '. When the target caption file is played to the display time of the "white" word in the first sentence, the "white" word starts to be continuously displayed, and likewise, when the target caption file is played to the display time of the corresponding word in the first sentence, the corresponding word starts to be continuously displayed. When the target caption file is played to the display time of the yellow character in the second sentence, all the characters in the first sentence are displayed, and the characters in the second sentence are displayed in sequence.

The subtitle generating method provided by the embodiment of the disclosure may be applied to the field of education, and as described above, the target audio may be a textbook audio provided by an education institution for assisting learning, for example, the target audio may be poetry audio, prose audio, text audio, foreign language audio, and songbook audio.

Taking poetry audio as an example, generating a reference poetry subtitle file comprising a poetry subtitle text and the display time of each character in the poetry subtitle text based on the poetry audio, determining whether the poetry subtitle text is different from the standard poetry text corresponding to the poetry audio, if so, determining difference information, adjusting the poetry subtitle text based on the difference information, and obtaining a target poetry subtitle file comprising the adjusted poetry subtitle text and the corresponding display time.

In the application stage, poetry audio frequency and target poetry subtitle files can be synchronously played, and characters corresponding to the currently played audio content are displayed in real time according to the display time corresponding to the poetry subtitle texts in the target poetry subtitle files, so that the students can efficiently learn poetry.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a subtitle generating apparatus corresponding to the subtitle generating method, and because the principle of the subtitle generating apparatus in the embodiment of the present disclosure for solving the problem is similar to the subtitle generating method in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 2 to 4, fig. 2 is a schematic diagram of a subtitle generating apparatus according to an embodiment of the present disclosure, fig. 3 is a second schematic diagram of a subtitle generating apparatus according to an embodiment of the present disclosure, and fig. 4 is a third schematic diagram of a subtitle generating apparatus according to an embodiment of the present disclosure. As shown in fig. 2, the subtitle generating apparatus 200 includes a data obtaining module 210, a reference file generating module 220, a target text generating module 230, and a target file generating module 240.

The data obtaining module 210 is configured to obtain a target audio of a subtitle to be generated and a standard text corresponding to the target audio.

The reference file generating module 220 is configured to generate a corresponding subtitle file based on the target audio, where the subtitle file includes a subtitle text and a display time of the subtitle text.

The target text generation module 230 is configured to determine whether the subtitle text and the standard text are different, if so, determine difference information, and adjust the subtitle text based on the difference information to obtain a target subtitle text;

and the target file generating module 240 is configured to determine target display time of a target subtitle text based on the display time of the subtitle text, and obtain a target subtitle file including the target subtitle text and the target display time.

According to the caption device provided by the embodiment of the disclosure, after the reference caption file of the target audio is acquired, the difference between the caption text of the reference caption file and the standard text corresponding to the target audio is compared to determine the problem existing in the caption text in time, and the caption text can be further adjusted according to the difference information of the caption text relative to the standard text, so that the correct caption text is acquired, and the accuracy of the caption file is greatly improved.

In a possible implementation manner, the reference file generating module 220, when configured to generate a corresponding subtitle file based on a target audio, is specifically configured to: generating a corresponding subtitle text based on the target audio; determining the display time of each word in the subtitle text based on the playing time of the audio clip corresponding to each word in the subtitle text in the target audio; and generating a reference subtitle file based on the subtitle text and the display time.

In a possible implementation manner, when the target text generation module 230 is configured to determine the difference information, it is specifically configured to: determining position information of difference content between the subtitle text and the standard text in the subtitle text and difference type information corresponding to the difference content; wherein the difference type information includes at least one of missing information, redundant information, and error information.

In one possible implementation, as shown in fig. 3, the subtitle generating apparatus 200 may further include a first information presentation module 250. The first information presentation module 250 is configured to: generating difference prompt information according to the position information and the difference type information; and prompting the user according to the difference prompting information.

In a possible implementation, the first information presentation module 250, when configured to prompt the user according to the difference prompt information, is specifically configured to: displaying a prompt text corresponding to the difference prompt information; and/or playing a prompt audio/video corresponding to the difference prompt information.

In one possible implementation, as shown in fig. 4, the subtitle generating apparatus 200 may further include a second information presentation module 260. The second information presentation module 260 is configured to: displaying the subtitle text; and displaying the difference content in the subtitle text based on the display form and the position information corresponding to the difference type information.

In a possible implementation manner, the second information presentation module 260, when configured to display the difference content in the subtitle text based on the display form and the position information corresponding to the difference type information, is specifically configured to:

In a possible implementation, the target text generating module 230, when configured to adjust the subtitle text based on the difference information, is specifically configured to:

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Corresponding to the subtitle generating method in fig. 1, an embodiment of the present disclosure further provides a computer device 500, as shown in fig. 5, for a schematic structural diagram of the computer device provided in the embodiment of the present disclosure, the computer device 500 includes a processor 510, a memory 520, and a bus 530. The memory 520 is used for storing instructions for execution and includes a memory 521 and an external memory 522. The memory 521 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 510 and data exchanged with an external memory 522 such as a hard disk, the processor 510 exchanges data with the external memory 522 through the memory 521, and when the computer device 500 operates, the processor 510 communicates with the memory 520 through the bus 530, so that the processor 510 executes the following instructions:

acquiring original reading audio of a user for a preset text; determining the starting position and the ending position of a target audio frequency segment in the original reading audio frequency based on the pronunciation information of the preset text; acquiring a target audio segment from the original reading audio according to the initial position and the end position of the target audio segment; synthesizing the target audio segment to a corresponding position of a target file to be synthesized; wherein, the target file to be synthesized is an audio/video file.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the subtitle generating method in the above-mentioned method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the subtitle generating method in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for generating subtitles, the method comprising:

2. The method of claim 1, wherein generating the corresponding subtitle file based on the target audio comprises:

generating corresponding subtitle texts based on the target audio;

3. The method of claim 1, wherein determining difference information comprises:

4. The method of claim 3, further comprising:

and prompting the user according to the difference prompting information.

5. The method of claim 4, wherein prompting the user according to the difference prompt message comprises:

displaying a prompt text corresponding to the difference prompt information;

6. The method of claim 3, further comprising:

displaying the subtitle text;

7. The method according to claim 6, wherein the displaying the difference content in the subtitle text based on the display form corresponding to the difference type information and the position information comprises:

8. The method of claim 3, wherein the adjusting the subtitle text based on the disparity information comprises:

9. A subtitle generating apparatus, comprising:

10. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the subtitle generating method according to any one of claims 1 to 8.

11. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the subtitle generating method according to any one of claims 1 to 8.