CN110381389B

CN110381389B - Subtitle generating method and device based on artificial intelligence

Info

Publication number: CN110381389B
Application number: CN201910740413.6A
Authority: CN
Inventors: 张宇露; 陈联武; 陈祺; 蔡建伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2022-02-25
Anticipated expiration: 2038-11-14
Also published as: CN110381388A; CN109379641B; CN110418208A; CN110381389A; CN109379641A; CN110381388B; CN110418208B

Abstract

The embodiment of the application discloses a subtitle generating method and device based on artificial intelligence, at least relates to a voice processing technology and a natural language processing technology in the artificial intelligence, and aims at a plurality of voice fragments to obtain texts respectively corresponding to the voice fragments through voice recognition and determine the time length of a mute fragment. And according to the sequence of the time axis of the audio stream, sequentially determining whether the duration of the silent segments is greater than the preset duration from the target voice segment, adding the text corresponding to the voice segment between the target silent segment greater than the preset duration and the target voice segment into a text group to be processed, and taking the separator in the text group to be processed as the basis for determining the subtitle text. The text part between the separators belongs to complete sentences, reasonable semantics can be embodied, and whether the silent segments are expression pauses among the sentences can be determined according to the preset duration, so that the possibility of incomplete sentences appearing in the subtitle text is reduced, and users watching audios and videos can be helped to understand audio and video contents.

Description

Subtitle generating method and device based on artificial intelligence

The application provides divisional application to a Chinese patent application with the application number of 201811355311.4, the application date of 2018, 11 and 14, and the invention name of 'a subtitle generating method and device'.

Technical Field

The present application relates to the field of audio processing, and in particular, to a subtitle generating method and apparatus based on artificial intelligence.

Background

When watching some audios and videos such as live webcasts and movies, a user can understand audio and video contents through subtitles displayed on an audio and video display picture.

In a conventional audio and video subtitle generation manner, an audio stream is mainly processed according to a silence segment so as to generate a subtitle. The mute segment may be a segment without voice in the audio stream of the audio and video, and the audio stream is segmented into a plurality of voice segments according to the mute segment, wherein a text corresponding to the voice in any one of the voice segments may be generated into a subtitle of the voice segment.

However, since the conventional method simply segments the audio stream according to a single audio signal feature, i.e. a silence segment, it is difficult to distinguish an expression pause within a sentence in a character expression from an expression pause between sentences, so that an inappropriate speech segment is often segmented, and thus the generated subtitle will include an incomplete sentence, which is difficult to help a user understand audio and video contents, and may even mislead the user, resulting in a bad experience.

Disclosure of Invention

In order to solve the technical problem, the application provides a subtitle generating method and device, the possibility that incomplete sentences appear in subtitle texts determined by separators is greatly reduced, and when the subtitle texts are displayed as subtitles in a corresponding audio stream time axis interval, users watching audio and video can be helped to understand audio and video contents, and user experience is improved.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for generating a subtitle, where the method includes:

acquiring a plurality of voice fragments which come from the same audio stream and are segmented according to the mute fragments;

performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively, wherein the texts corresponding to the voice fragments respectively comprise separators added according to text semantics;

when a subtitle is determined according to a text corresponding to a target voice fragment in the voice fragments, determining a text group to be processed, wherein the text group to be processed at least comprises a text corresponding to the target voice fragment;

determining a subtitle text from the text group to be processed according to the separators in the text group to be processed;

and taking the subtitle text as the subtitle of the time axis interval of the corresponding audio stream.

In a second aspect, an embodiment of the present application provides a subtitle generating apparatus, which includes an obtaining unit, an identifying unit, a first determining unit, a second determining unit, and a generating unit:

the acquiring unit is used for acquiring a plurality of voice fragments which are from the same audio stream and are segmented according to the mute fragments;

the recognition unit is used for performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively, and the texts corresponding to the voice fragments respectively comprise separators added according to text semantics;

the first determining unit is configured to determine a to-be-processed text group when a subtitle is determined according to a text corresponding to a target speech segment in the plurality of speech segments, where the to-be-processed text group at least includes a text of the target speech segment;

the second determining unit is used for determining a subtitle text from the text group to be processed according to the separators in the text group to be processed;

and the generating unit is used for taking the subtitle text as the subtitle of the corresponding audio stream time axis interval.

In a third aspect, an embodiment of the present application provides an apparatus for subtitle generation, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the subtitle generating method according to any one of the first aspect according to instructions in the program code.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program codes, where the program codes are used to execute the subtitle generating method according to any one of the first aspects.

According to the technical scheme, in the process of generating the subtitles according to the plurality of voice fragments which are from the same audio stream and segmented according to the mute fragments, the plurality of voice fragments are subjected to voice recognition to obtain texts corresponding to the plurality of voice fragments respectively, and the texts corresponding to the plurality of voice fragments respectively comprise separators added according to text semantics. When the subtitle is determined according to the text corresponding to the target voice segment, a text group to be processed for generating the subtitle is determined, and the text group to be processed at least comprises the text corresponding to the target voice segment. After the text group to be processed is determined, the separator in the text group to be processed can be used as a basis for determining the caption text from the text group to be processed, and as the separator in the text group to be processed is added based on semantics when the Chinese characters in the speech fragment are recognized, and the text part among the separators belongs to complete sentences, reasonable semantics can be embodied, the possibility that incomplete sentences appear in the caption text determined by the separator is greatly reduced, and when the caption text is displayed as the caption in the time axis interval of the corresponding audio stream, the caption text can help users watching audio and video to understand audio and video contents, and user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of a subtitle generating method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a relationship between an audio stream, a silence segment, and a speech segment according to an embodiment of the present application;

fig. 3 is a flowchart of a subtitle generating method according to an embodiment of the present application;

fig. 4 is a flowchart of a method for determining a pending text group according to an embodiment of the present application;

fig. 5 is a flowchart of a method for generating subtitles according to a time axis interval of an audio stream according to a subtitle text according to an embodiment of the present application;

fig. 6 is an exemplary diagram for determining a time axis interval of an audio stream corresponding to a subtitle text according to an embodiment of the present application;

fig. 7 is a flowchart of a subtitle generating method according to an embodiment of the present application;

fig. 8 is a flowchart illustrating a structure of subtitle generation according to an embodiment of the present disclosure;

fig. 9a is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;

fig. 9b is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;

fig. 9c is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;

fig. 9d is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;

fig. 9e is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of an apparatus for generating subtitles according to an embodiment of the present application;

fig. 11 is a block diagram of an apparatus for generating subtitles according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In a conventional subtitle generating method, an audio stream is mainly processed according to a silence segment to generate a subtitle. The silent segment can reflect the pause between sentences of the user during expression to a certain extent, but different users have different expression habits, and some users may have expression pause in a sentence. For example, in the sentence "two children play and hide in the clear day", wherein a blank space indicates an expression pause in the "clear day", the expression pause occurs between the "clear day" and the "clear day" in the sentence due to the expression habit of the user or the user's thinking when expressing the sentence, etc.

If the audio stream where the child is playing and hiding on the bright day is divided into voice segments, one voice segment corresponds to one subtitle, so that the child is playing and hiding on the bright day as one subtitle, and the generated subtitle comprises an incomplete sentence. When the subtitles are displayed, the user first sees the subtitles "in this sunny day" and then sees the subtitles "two children are playing and hiding" on the bright day ", so that the understanding of the user may be affected, and bad experience may be caused.

Therefore, the embodiment of the application provides a subtitle generating method, which adopts a new subtitle generating method on the basis of cutting an audio stream into a plurality of voice segments according to a mute segment, and takes separators as the basis for determining subtitle texts.

The subtitle generating method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, a method, a technology and an application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned voice processing technology and natural language processing technology.

For example, the present invention may relate to a Speech recognition Technology in Speech Technology (Speech Technology), which includes Speech signal preprocessing (Speech signal preprocessing), Speech signal frequency domain analysis (Speech signal analysis), Speech signal feature extraction (Speech signal feature extraction), Speech signal feature matching/recognition (Speech signal feature matching/recognition), training of Speech (Speech training), and the like.

For example, Text preprocessing (Text preprocessing) and Machine Translation (Machine Translation) in Natural Language Processing (NLP) may be involved, including word, sentence segmentation (word/content segmentation), word tagging (word tagging), sentence classification (word/content classification), Translation word selection (word selection), sentence generation (content generation), word-activity (word-activity), editing and outputting (editing and outputting), and the like.

It can be understood that, compared with the conventional subtitle generating method, the subtitle generating method provided by the embodiment of the present application reduces the possibility of incomplete sentences appearing in the subtitle text, and does not need manual later-stage proofreading, so that the subtitle generating method provided by the embodiment of the present application can be applied to scenes requiring real time, such as live video, video chat, games, and the like, and of course, the subtitle generating method provided by the embodiment of the present application can also be applied to non-live scenes, for example, subtitles can be generated for recorded audio and video files.

The subtitle generating method provided by the embodiment of the application can be applied to audio and video processing equipment with subtitle generating capacity, and the audio and video processing equipment can be terminal equipment or a server.

The audio-video processing device may have the capability to implement automatic speech recognition techniques (ASR) and voiceprint recognition techniques, among others, in speech technology. The audio and video processing equipment can listen, see and feel, and the development direction of man-machine interaction in the future is provided, wherein voice becomes one of the best viewed man-machine interaction modes in the future.

In the embodiment of the application, the audio and video processing device can perform voice recognition on the acquired voice segments by implementing the voice technology, so that functions such as texts corresponding to the voice segments are obtained.

The audio and video processing device can also have the capability of implementing Natural Language Processing (NLP), and the NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

In the embodiment of the application, the audio and video processing device can realize the process of determining the caption text through the determined text, and the functions of translating the caption text and the like by implementing the NLP technology.

If the audio/video processing device is a terminal device, the terminal device may be an intelligent terminal, a computer, a Personal Digital Assistant (PDA for short), a tablet computer, or the like.

If the audio and video processing device is a server, the server may be an independent server or a cluster server. When the server obtains the caption text by using the caption generating method, the caption text is used as the caption of the corresponding audio stream time axis interval to be displayed on the terminal equipment corresponding to the user, so that the real-time display of the caption in the video live broadcasting process is realized.

In order to facilitate understanding of the technical solution of the present application, the following describes a subtitle generating method provided in the embodiments of the present application with reference to an actual application scenario.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a subtitle generating method according to an embodiment of the present application. The application scenario is introduced by taking the example that the subtitle generating method is applied to a server (audio and video processing equipment is the server). The application scenario includes the server 101, and the server 101 may obtain a plurality of voice segments, such as voice segment 1, voice segment 2, voice segment 3, and the like in fig. 1, which are derived from the same audio stream and are obtained according to the generation time sequence of the voice segments, and are segmented according to the silence segments.

The audio stream comprises voice sent by a character in an object to be processed, the object to be processed can be audio and video generated based on a live scene, and can also be a determined audio and video file, such as a recorded and downloaded audio and video file, and the object to be processed comprises the audio stream; the voice uttered by the character may refer to a live player speaking in a live scene, or may be a played audio file including voice, such as recording, playing a song, and the like.

A speech segment may refer to a portion of an audio stream that includes speech information; the silent segment may refer to a portion of the audio stream without voice information, and may represent an expression pause within a sentence or an expression pause between sentences that appears when the user expresses the speech.

The relationship among the audio stream, the silent sections and the voice sections can be shown in fig. 2, and as can be seen from fig. 2, for the audio stream corresponding to the time points 0-t1 on the time axis, in the process of acquiring the audio stream, the audio stream can be divided into a plurality of voice sections according to the acquired silent sections, for example, voice section 1, voice section 2, voice section 3 and voice section 4 in fig. 2.

It should be noted that the voice segment may be segmented by the server according to the silence segment when the server acquires the audio stream, or the server may directly acquire the segmented voice segment.

The server 101 performs voice recognition on the acquired voice fragments to obtain texts corresponding to the voice fragments, wherein the texts corresponding to the voice fragments respectively comprise separators added according to text semantics.

Because the separators are added based on the semantics when the Chinese characters in the voice fragment are recognized, the text part between the separators belongs to the complete sentence, and the reasonable semantics can be embodied, so that the possibility of incomplete sentences in the caption text determined by the separators in the following process is greatly reduced.

The delimiters may include punctuation marks and special symbols, wherein the punctuation marks may include periods, commas, exclamation marks, question marks, and the like; the special symbols may include space characters, underlines, vertical lines, diagonal lines, and the like.

When the server 101 determines a subtitle according to a text corresponding to a certain voice segment, for example, a target voice segment, in a plurality of voice segments (fig. 1, the voice segment 2 is taken as the target voice segment), a to-be-processed text group may be determined, where the to-be-processed text group at least includes a text corresponding to the target voice segment.

It should be noted that, in the embodiment of the present application, the subtitle text is determined based on the separators for each voice segment, when determining the subtitle according to the text corresponding to the target voice segment, since the target voice segment is not necessarily the first processed voice segment in the audio stream, when determining the subtitle text based on the separators for the last time, a portion of text corresponding to the target voice segment may have been already used for determining the subtitle text, so that the text corresponding to the target voice segment may not be the entire text corresponding to the target voice segment, but may be the remaining portion of text corresponding to the target voice segment after the subtitle is generated for the last time.

Taking "two children play and hide on this clear day" as an example, the text corresponding to the segmented voice segments may be "two children play and hide on this clear day", where "," is a separator, and when determining the caption text based on the separator "," the text corresponding to the voice segments "on the clear day", the text corresponding to the two children play and hide "on the clear day", and the text corresponding to the voice segments "on the clear" are generated together, so that, when the voice segments "on the clear day" and two children play and hide "as the target voice segments, the text corresponding to the target voice segments is" two children play and hide ", and is the partial text corresponding to the target voice segments.

Of course, the text corresponding to the target speech segment may also be all texts corresponding to the target speech segment, which is not limited in this embodiment of the application.

For example, the target speech segment is a speech segment that is processed for the first time with respect to the audio stream, or when the subtitle text is determined according to the delimiter at the last time, the text corresponding to the target speech segment is not used for generating the subtitle text, and the text corresponding to the target speech segment is all the text corresponding to the target speech segment.

It should be noted that the text group to be processed may only include the text corresponding to the target speech segment, or may also include texts corresponding to a plurality of speech segments including the text corresponding to the target speech segment. When the text group to be processed includes texts corresponding to a plurality of speech segments, the text group to be processed may be formed by splicing a text corresponding to the target speech segment and a text corresponding to one or more speech segments subsequent to the target speech segment, and how to determine the text group to be processed is described later. And then determining a subtitle text through the separators based on the text group to be processed, and further generating the subtitles in the time axis interval of the corresponding audio stream.

In this embodiment, the caption text may be recognized based on the language type in the speech segment, but the caption is not limited to the language type in the speech segment, and the language type of the caption may be determined based on the user requirement, may be the language type in the speech segment, may be another language type, and may further include multiple language types. For example, if the subtitle text is english, the displayed subtitle may be an english subtitle, a chinese subtitle, or a chinese subtitle.

Next, a subtitle generating method provided by an embodiment of the present application will be described with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 shows a flowchart of a subtitle generating method, which includes:

s301, a plurality of voice segments which are from the same audio stream and are segmented according to the mute segments are obtained.

The obtained voice segments segmented according to the silence segments may include many voice segments, which may belong to different audio streams, and in this embodiment, the obtained voice segments are voice segments from the same audio stream and are obtained sequentially according to the generation time sequence of the voice segments.

S302, carrying out voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively.

When the texts corresponding to the multiple speech segments are obtained through speech recognition, separators can be added to the texts corresponding to the multiple speech segments based on text semantic addition, so that subtitle texts can be determined through the separators in the following process.

And S303, determining a text group to be processed when determining the subtitles according to the text corresponding to the target voice fragment in the plurality of voice fragments.

Aiming at the voice segments of the same audio stream, the texts corresponding to the voice segments need to be processed according to the generation time sequence of the voice segments, the texts corresponding to the voice segments according to which the subtitles are based, namely the texts corresponding to the target voice segments, are currently determined, and the determined text group to be processed at least comprises the texts of the target voice segments.

The segmentation of the speech segments may be based on the expression pause between sentences or within a sentence, and in order to reduce the possibility that the text set to be processed includes incomplete sentences due to the expression pause in a sentence, the embodiment provides a method for determining the text set to be processed.

Referring to fig. 4, the method includes:

s401, determining the time length of a silence segment among the voice segments.

The time length of the silent segment can reflect to a certain extent whether the silent segment is an expression pause between sentences or an expression pause in a sentence. Generally, the time length of the silence segment generated by the expression pause in a sentence is relatively small, and the time length of the silence segment generated by the expression pause between sentences is relatively large, so that which voice segments are possibly spliced with the target voice segment to form the text group to be processed can be known according to the determined time length of the silence segment.

The time length of the silence segment may be determined by: when acquiring the voice segment, recording the ending time stamp T of the voice segment aiming at the current voice segment_{sil_begin}And the start time stamp T of the next speech segment_{sil_end}Sequentially calculating the time length T of the silence segment after the current voice segment_silI.e. T_sil＝T_{sil_end}-T_{sil_begin}。

S402, according to the sequence of the time axis of the audio stream, sequentially determining whether the time length of the mute section is greater than a preset time length from the target voice section.

The preset duration is determined according to the duration of the expression pause between sentences in the expression of the user under normal conditions, and whether the silent segment is the expression pause between sentences or the expression pause in a sentence can be determined according to the preset duration.

Referring to fig. 2, according to the sequence of the time axis of the audio stream, a voice segment 1, a mute segment a, a voice segment 2, a mute segment B, a voice segment 3, a mute segment C, and a voice segment 4 are arranged in sequence, wherein the time length of the mute segment a is T_sil-1The time length of the mute segment B is T_sil-2The time length of the mute segment C is T_sil-3If the speech segment 1 is a target speech segment, it is necessary to sequentially determine whether the time length of the silence segment is greater than a preset time length from the silence segment a, if not, the silence segment may be an expression pause in a sentence, then continuously determine whether the time length of the silence segment B is greater than the preset time length, and so on until the time length of a certain silence segment is determined to be greater than the preset time length, at this time, the silence segment may be an expression pause between sentences, that is, texts corresponding to the first and the last speech segments of the silence segment may be in two different sentences.

S403, if the time length of the target mute segment is determined to be greater than the preset time length, adding the text corresponding to the voice segment between the target mute segment and the target voice segment into the text group to be processed.

In the process of sequentially determining whether the time length of the silence segment is greater than the preset time length, the time lengths of the silence segments before the target silence segment are all less than the preset time length, and texts corresponding to the voice segments between the target silence segment and the target voice segment may be in the same sentence, so the texts corresponding to the voice segments can be spliced. Once it is determined that the time length of a silence segment (e.g., a target silence segment) is greater than the preset duration, the step of determining whether the time length of the silence segment is greater than the preset duration may be stopped, and in order to reduce the possibility that the text group to be processed includes an incomplete sentence due to a pause in expression in a sentence, texts corresponding to the speech segment between the target silence segment and the target speech segment may be spliced to form the text group to be processed.

Referring to FIG. 2, if T is determined in turn_sil-1Less than a predetermined duration, T_sil-2Less than a predetermined duration, and T_sil-3If the length of time is longer than the preset length of time, it may be considered that the silent segment a and the silent segment B may be an expression pause in a sentence, and the silent segment C may be an expression pause between sentences, and the silent segment C may be used as a target silent segment, and the texts corresponding to the speech segment 1, the speech segment 2, and the speech segment 3 may be spliced into a text group to be processed.

The method sequentially determines whether the silent segments behind the target voice segment represent the expression pause in a sentence through the time length of the silent segments, so that texts corresponding to the voice segments separated by the expression pause in the sentence are spliced to form a text group to be processed, and the possibility that the text group to be processed comprises an incomplete sentence due to the expression pause in the sentence is reduced.

S304, determining a subtitle text from the text group to be processed according to the separators in the text group to be processed.

Because the separators in the text group to be processed are added based on semantics when the Chinese characters in the voice fragment are recognized, the text part between the separators belongs to the complete sentence, and the reasonable semantics can be embodied, the possibility that the incomplete sentence appears in the caption text determined by the separators is greatly reduced.

For example, the text corresponding to the speech segment 1 is "in this fine day", the text corresponding to the speech segment 2 is "in this fine day, two children play and hide", when the speech segment 1 is taken as the target speech segment, it is determined through S303 that the text group to be processed is "in this fine day, two children play and hide", where "is a separator, and then" in this fine day "can be determined as the caption text according to the separator in the text group to be processed; when the processing is continued, "on the bright day" in the text corresponding to the voice fragment 2 is used for the previous processing to generate the caption text, but the text "two children are playing and hiding" still remains in the text corresponding to the voice fragment 2, then when the caption is determined according to the text corresponding to the voice fragment 2 (target voice fragment), the text corresponding to the target voice fragment is the partial text "two children are playing and hiding" corresponding to the voice fragment 2, but not "on the bright day" and two children are playing and hiding ", at this time, S303-S305 are continuously performed for the text" two children are playing and hiding "corresponding to the target voice fragment.

Compared with the traditional mode, in the traditional mode, the text corresponding to the voice fragment 1 is in the clear day, the text corresponding to the voice fragment 2 is in the bright day, the two children play and hide and respectively correspond to a caption text, and the two caption texts comprise incomplete sentences.

And S305, taking the subtitle text as the subtitle of the corresponding audio stream time axis interval.

When the caption text is used as the caption of the corresponding audio stream time axis interval for displaying, the user watching the audio and video can be helped to understand the audio and video content, and the user experience is improved.

The foregoing embodiments describe a subtitle generating method, and during generating a subtitle, a subtitle text needs to be determined from a text group to be processed according to a delimiter, and since the text group to be processed and the delimiter in the text group to be processed may have different situations, for example, when determining the subtitle text, it may be necessary to consider a length of a displayed subtitle and determine, according to which delimiter, that the subtitle text is appropriate, and a manner of determining the subtitle text may be different under different situations. In the present embodiment, the subtitle text may be determined with reference to the following formula in different cases:

wherein L is_textMay indicate the determined text length, L, of the subtitle_silCan represent the text length of the text group to be processed; l is_segCan represent the preset number, is determined according to the length of the displayed caption; l is_puncThe length of the text between the first character and the last separator in a preset number of characters before the text group to be processed can be taken as the length of the text between the first character and the last separator in the text group to be processed; l is_maxThe maximum number may be the number of characters corresponding to the longest length of the displayed subtitles.

Based on the above formula, the appropriate subtitle text can be determined under different circumstances. Next, the way of determining the subtitle text from the text group to be processed under different conditions will be described one by one.

The first case may be that it is to be processedThe text length of the text group is less than a predetermined number, i.e. L_sil<L_segAt this time, the formula L can be used_text＝L_silSubtitle text is determined.

Specifically, in a general case, the length of the displayed subtitle is affected by the font size of the subtitle, the size of the display screen, the user experience, and the like, and the displayed subtitle needs to have a reasonable length, that is, the length of the displayed subtitle. The display subtitle length may be expressed in terms of the number of characters in the display subtitle. Thus, after the text group to be processed is obtained, whether the number of characters of the text group to be processed is larger than the preset number or not can be judged, namely L is judged_silWhether or not it is greater than L_segThe preset number is determined according to the length of the displayed caption, and the preset number is the number of characters in the displayed caption when the preset number accords with the length of the displayed caption; if not, the text group to be processed can be considered to meet the requirement of displaying the caption length, and the text group to be processed can be directly determined as the caption text, namely L_text＝L_sil。

The second case may be that the text length of the group of texts to be processed is greater than a preset number and there is a separator, i.e. L_sil>L_seg&L_punc>0, in this case, the formula L_text＝L_puncSubtitle text is determined.

If the number of the characters of the text group to be processed is larger than the preset number, L is_sil>L_segIf the number of characters in the text group to be processed is too large, the text group to be processed needs to be intercepted, so that the subtitle text meeting the requirement of displaying the subtitle length is obtained from the text group to be processed, and if the separator exists in the text group to be processed, the L is the L_punc>0, S304 may be executed to determine the caption text, i.e., L_text＝L_punc。

It should be noted that, the manner of determining the caption text according to the delimiter has been briefly described in the embodiment (S304) corresponding to fig. 3, and next, how to determine the caption text from the text group to be processed according to the delimiter, that is, how to determine L_punc。

It should be noted that determining the subtitle text from the text group to be processed according to the delimiter includes two determination manners, where a first determination manner may be: determining the part between the first character and the last separator in the text group to be processed as a caption text, namely L_puncThe length of the text from the first character to the last delimiter in the group of texts to be processed.

For example, the pending text group is "on a sunny day, two children are playing and hiding, and they are happy to play. However, according to a first determination, the first character in the group of texts to be processed is "at" and the last delimiter is ". ", then," at "and". The section between can be as the subtitle text, that is, the subtitle text is "in sunny days, two children play and hide in the game, and they play with great fun. ".

However, in some cases, in order to further ensure that the caption text determined from the text group to be processed according to the delimiter meets the requirement of displaying the caption length, the caption text may be determined from the text group to be processed according to the delimiter while taking the length of the displayed caption into consideration, that is, the second determination method may be: determining a part between a first character in the text group to be processed and a last separator in a preset number of characters before the text group to be processed as a subtitle text, wherein the preset number is determined according to the length of a displayed subtitle, namely L_puncThe length of the text from the first character in the text group to be processed to the last separator in the preset number of characters before the text group to be processed is determined.

For example, the pending text group is "on a sunny day, two children are playing and hiding, and they are happy to play. However, the preset number is 25, according to a second determination manner, the first character in the text group to be processed is "on", the 25 th character is "on", the first character is "on" and the last separator from "to" the first 25 characters in the text group to be processed is "second", and then "between" and "second", the part between "and" second "may be used as the caption text, that is, the caption text is" in clear days, two children play and hide ", and". Therefore, the subtitle text determined by the second determination method comprises 19 characters, the requirement for displaying the subtitle length is met, and user experience is better.

The third case may be that the text length of the text group to be processed is greater than the preset number and there is no delimiter, i.e., L_sil>L_seg&L_puncWhen the value is 0, the formula L can be used_text＝min(L_sil,L_max) Subtitle text is determined.

It should be noted that the premise of determining the caption text from the text group to be processed according to the separator in the text group to be processed in S304 is that the separator is included in the text group to be processed, however, in some cases, the separator may not be included in the text group to be processed, for example, the text group to be processed may be "that the family address of a friend wearing red clothes is" home address of south street, 5 street, 3 unit, 2 building 301 room, guan cun, respectively, in the hai lake of beijing city ". Then, when the number of characters of the text group to be processed is greater than the preset number and the text group to be processed does not include separators, the method for determining the subtitle text from the text group to be processed is introduced.

The length of the displayed caption is a reasonable length of the caption when the caption is displayed, and the length of the caption is limited by the longest length of the displayed caption, so that the caption text can be determined according to the longest length of the displayed caption besides the length of the displayed caption. The number of the characters of the text group to be processed is larger than the preset number, which only indicates that the number of the characters of the text group to be processed exceeds the length of the caption displayed under the ordinary condition, but does not indicate that the text group to be processed cannot be accepted, that is, does not indicate that the text group to be processed cannot be used as the caption text, as long as the number of the characters of the text group to be processed does not exceed the number of the characters corresponding to the longest length of the caption displayed.

Specifically, when it is determined whether the number of characters of the to-be-processed text group is greater than the preset number and the to-be-processed text group does not include the delimiter, it may be further determined whether the number of characters of the to-be-processed text group is greater than the maximum number, that is, L is determined_silWhether or not it is greater than L_maxSaid maximum number L_maxThe number of characters corresponding to the longest length of the displayed caption; if so, it is indicated that the number of characters of the text group to be processed exceeds the maximum length of the accepted displayed caption, and a part of characters need to be cut out from the text group to be processed as a caption text, for example, the maximum number of characters before the text group to be processed can be determined as the caption text; if not, the number of the characters of the text group to be processed is within the longest length of the accepted displayed caption, and the text group to be processed can be directly determined as the caption text, so that the text with smaller text length is determined from the text group to be processed as the caption text, namely L_text＝min(L_sil,L_max)。

For example, the text group to be processed is "the home address of the child wearing red clothes is" the home address of the child is "building 301 room of department 2, building 3, and room 5 in the south guancun avenue of the hai lake region of beijing city", the maximum number is 30, at this time, the number of characters of the text group to be processed is 43, then, the number of characters of the text group to be processed is 43 greater than the maximum number 30, the character at the front 30 of the text group to be processed can be determined as the caption text, that is, the caption text is "the home address of the child wearing red clothes is" the south guancun avenue of the south guan province of the hai lake region of beijing city ".

For another example, the text group to be processed may be "the family address of the friend is the south avenue 5 th yard in the hai lake area of beijing", the maximum number is 30, at this time, the number of characters of the text group to be processed is 26, then, the number of characters of the text group to be processed is less than the maximum number 30, then the text group to be processed may be determined as the subtitle text, that is, the subtitle text is "the family address of the friend is the south avenue 5 th yard in the hai lake area of beijing".

The purpose of determining the subtitle text is to generate subtitles for the corresponding audio stream, and then, how to generate subtitles for the time axis interval of the corresponding audio stream according to the subtitle text will be described.

It should be noted that, in the conventional method for generating subtitles, only the silent segment is used to segment and determine the subtitle text, and then the subtitle corresponding to the time axis interval of the audio stream is generated according to the subtitle text, so that only the time offset of the voice segment needs to be recorded during the segmentation. In the embodiment of the present application, when determining the subtitle text from the text group to be processed according to the delimiter, the subtitle text may be subdivided according to the delimiter, and it is difficult to ensure the accuracy of the determined subtitle text at the corresponding time on the time axis only by the time offset of the speech segment. Therefore, the present embodiment provides a method for generating subtitles according to a time axis interval of a corresponding audio stream according to a subtitle text, and referring to fig. 5, the method includes:

s501, determining the relative starting time of the first character in the caption text in the corresponding voice segment.

And S502, determining the starting time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative starting time and the time offset of the voice clip corresponding to the first character on the time axis of the audio stream.

S503, determining the relative end time of the last character in the caption text in the corresponding voice segment.

S504, determining the end time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative end time and the time offset of the voice clip corresponding to the last character on the time axis of the audio stream.

Therefore, the subtitle of the corresponding audio stream time axis interval can be generated according to the subtitle text according to the starting time and the ending time of the audio stream time axis interval corresponding to the subtitle text.

It will be appreciated that in subdividing the subtitle text according to the delimiters, the relative start and end times of each character in the subtitle text may be determined by the speech recognition engine. The expression format of the relative start time and the relative end time of each character can be as follows:

for example, it can be determined that Word _1 in the subtitle text has a relative start time start of 500ms and a relative end time end of 750ms, and so on.

Referring to fig. 6, if the determined subtitle text is as shown between A, B in the figure, where a is located at a position corresponding to a first character in the subtitle text, B is located at a position corresponding to a last character in the subtitle text, a start time of an audio stream time axis interval corresponding to the subtitle text is a time corresponding to the position a, and an end time of the audio stream time axis interval corresponding to the subtitle text is a time corresponding to the position B.

As can be seen from fig. 6, the relative start time of the first character in the corresponding voice segment is t1, the voice segment corresponding to the first character is the voice segment 2, and the time offset of the voice segment 2 on the time axis of the audio stream is t2, so that, according to the relative start time t1 and the time offset t2 of the voice segment 2 on the time axis of the audio stream, the start time of the time axis interval of the audio stream corresponding to the subtitle text is t1+ t 2; the relative end time of the last character in the corresponding voice segment is t3, the voice segment corresponding to the last character is the voice segment 3, and the time offset of the voice segment 3 on the time axis of the audio stream is t4, so that the end time of the time axis interval of the audio stream corresponding to the subtitle text can be determined to be t3+ t4 according to the relative end time t3 and the time offset t4 of the voice segment 3 on the time axis of the audio stream.

The method also needs to combine the relative time of the characters in the corresponding voice clips on the basis of the time offset of the voice clips, so that the accuracy of the determined corresponding time of the subtitle text on a time axis is ensured.

It can be understood that, in many cases, the language of the voice segment in the audio/video is not the language used by the user on a daily basis, and at this time, in order to help the user watching the audio/video understand the audio/video content, the subtitle text as the subtitle should be represented by the language used by the user on a daily basis. Therefore, in this embodiment, the subtitle text determined in S304 may also be translated according to the subtitle display language to obtain a translated subtitle text, and the translated subtitle text is used as a subtitle of the time axis interval of the corresponding audio stream.

The caption display language can be set by a user according to the requirement, for example, the language of a voice fragment in audio and video is english, and the user is a Chinese person, so the caption display language can be Chinese, so that the caption text with the english language can be translated into the caption text with the Chinese language, and the caption text with the Chinese language is used as the translated caption text, so that the caption text with the Chinese language is used as the caption text in the corresponding audio stream time axis interval, and the user can understand the audio and video content conveniently.

Next, a description will be given of a subtitle generating method provided in this embodiment of the present application with reference to a specific scene, where the scene is a live video scene for a speaker, and it is assumed that the speaker uses english to perform speech, and then to help a viewer watching a live video to understand speech content of the speaker, a subtitle needs to be generated in real time for the speech of the speaker, and at this time, the generated subtitle may be a chinese-english subtitle in order to facilitate the viewer to learn and understand. In this scenario, referring to fig. 7, the subtitle generating method includes:

s701, obtaining a plurality of voice segments which are from the same audio stream and segmented according to the mute segments.

S702, determining the time length of a mute segment among the voice segments.

And S703, performing voice recognition on the plurality of voice segments to obtain texts corresponding to the plurality of voice segments respectively.

S704, when determining the subtitles according to the texts corresponding to the target voice clips in the plurality of voice clips, sequentially determining whether the time length of the mute clip is greater than a preset time length from the target voice clip according to the sequence of the time axis of the audio stream, if so, executing S705, and if not, executing S704.

S705, adding the text corresponding to the voice segment between the target mute segment and the target voice segment into the text group to be processed.

S706, judging whether the number of the characters of the text group to be processed is larger than a preset number, if so, executing S707, and if not, executing S711.

S707, determining whether the text group to be processed includes a separator, if so, executing S708, and if not, executing S709.

S708, determining a subtitle text from the text group to be processed according to the separators in the text group to be processed.

And S709, judging whether the number of the characters of the text group to be processed is larger than the maximum number, if so, executing S710, and if not, executing S711.

And S710, determining the maximum number of characters in front of the text group to be processed as the subtitle text.

And S711, determining the text group to be processed as the subtitle text.

And S712, translating the subtitle text through machine translation.

And S713, taking the caption text and the translated caption text as the caption of the corresponding audio stream time axis interval.

In this application scenario, the structure flow of subtitle generation can be seen in fig. 8. The segmentation based on the silence segment to obtain the voice segment 1, … … voice segment 4, and the like corresponds to S701 in fig. 7; based on the mute segment/semantic combination, the text corresponding to the voice segment is divided again, so as to obtain a subtitle text, which corresponds to S702-S711 in fig. 7; translating the subtitle text by machine translation to obtain a translated subtitle text, for example, performing machine translation on the subtitle text 1 to obtain a subtitle text 1', and the like, which corresponds to S712 in fig. 7; the audio stream time axis and the caption text obtained by machine translation are merged with the time axis according to the audio stream time axis to generate a corresponding caption, which corresponds to S713 in fig. 7. After the subtitles are obtained, the subtitles can be pushed to be played in real time.

Based on a subtitle generating method provided by the foregoing embodiment, the present embodiment provides a subtitle generating apparatus 900, referring to fig. 9a, the apparatus 900 includes an obtaining unit 901, an identifying unit 902, a first determining unit 903, a second determining unit 904, and a generating unit 905:

the acquiring unit 901 is configured to acquire a plurality of voice segments that are derived from the same audio stream and segmented according to the silence segment;

the recognition unit 902 is configured to perform speech recognition on the multiple speech segments to obtain texts corresponding to the multiple speech segments, where the texts corresponding to the multiple speech segments include separators added according to text semantics;

the first determining unit 903 is configured to determine a to-be-processed text group when processing a text of a target speech segment in the plurality of speech segments, where the to-be-processed text group at least includes a text of the target speech segment;

the second determining unit 904 is configured to determine a subtitle text from the to-be-processed text group according to the separator in the to-be-processed text group;

the generating unit 905 is configured to use the subtitle text as a subtitle of a time axis interval of the corresponding audio stream.

In one implementation, referring to fig. 9b, the apparatus 900 further comprises a third determining unit 906:

the third determining unit 906, configured to determine a time length of a silence segment between the plurality of voice segments;

the first determining unit 903 is specifically configured to determine, according to an order of a time axis of an audio stream, whether a time length of a silence segment is greater than a preset time length from the target voice segment in sequence;

and if the time length of the target mute segment is determined to be greater than the preset time length, adding the text corresponding to the voice segment between the target mute segment and the target voice segment into the text group to be processed.

In one implementation, referring to fig. 9c, the apparatus 900 further includes a first determining unit 907 and a fourth determining unit 908:

the first determining unit 907 is configured to determine whether the number of characters in the to-be-processed text group is greater than a preset number, where the preset number is determined according to a length of a displayed subtitle;

if the first determining unit 907 determines that the number of characters in the to-be-processed text group is greater than the preset number, the second determining unit 904 is triggered to execute the step of determining the subtitle text from the to-be-processed text group according to the separators in the to-be-processed text group;

the fourth determining unit 908 is configured to determine the text group to be processed as the subtitle text if the first determining unit 907 determines that the number of characters of the text group to be processed is not greater than a preset number.

In an implementation manner, the second determining unit 904 is specifically configured to:

determining a part from the first character to the last separator in the text group to be processed as a subtitle text; or,

and determining a part between a first character in the text group to be processed and a last separator in a preset number of characters before the text group to be processed as a subtitle text, wherein the preset number is determined according to the length of a displayed subtitle.

In one implementation, if the first determining unit 907 determines that the number of characters in the to-be-processed text group is greater than the preset number and the to-be-processed text group does not include a separator, referring to fig. 9d, the apparatus 900 further includes a second determining unit 909 and a fifth determining unit 910:

the second determining unit 909 is configured to determine whether the number of characters of the to-be-processed text group is greater than a maximum number, where the maximum number is a number of characters corresponding to a longest length of a displayed subtitle;

the fifth determining unit 910 is configured to determine, if the second determining unit 909 determines that the number of characters of the text group to be processed is greater than the maximum number, the maximum number of characters before the text group to be processed is determined as the subtitle text;

if the second determining unit 909 determines that the number of characters in the to-be-processed text group is not greater than the maximum number, the fourth determining unit 908 is triggered to execute the step of determining the to-be-processed text group as the subtitle text.

In one implementation, referring to fig. 9e, the apparatus 900 further includes a sixth determining unit 911, a seventh determining unit 912, an eighth determining unit 913, and a ninth determining unit 914:

the sixth determining unit 911 is configured to determine a relative starting time of a first character in the subtitle text in the corresponding speech segment;

the seventh determining unit 912, configured to determine a starting time of an audio stream time axis interval corresponding to the subtitle text according to the relative starting time and a time offset of the voice segment corresponding to the first character on the audio stream time axis;

the eighth determining unit 913 is configured to determine a relative end time of the last character in the subtitle text in the corresponding speech segment;

the ninth determining unit 914 is configured to determine the ending time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative ending time and the time offset of the voice segment corresponding to the last character on the time axis of the audio stream.

The embodiment of the present application further provides an apparatus for generating subtitles, which is described below with reference to the accompanying drawings. Referring to fig. 10, an embodiment of the present application provides an apparatus 1000 for subtitle generation, where the apparatus 1000 may be a server, may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1022 (e.g., one or more processors) and a memory 1032, and one or more storage media 1030 (e.g., one or more mass storage devices) storing an application program 1042 or data 1044. Memory 1032 and storage medium 1030 may be, among other things, transient or persistent storage. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1022 may be provided in communication with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the apparatus 1000 for subtitle generation.

The apparatus 1000 for subtitle generation may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 10.

The CPU 1022 is configured to execute the following steps:

Referring to fig. 11, an apparatus 1100 for generating subtitles is provided in an embodiment of the present application, where the apparatus 1100 may also be a terminal apparatus, and the terminal apparatus may be any terminal apparatus including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal apparatus is a mobile phone:

fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 11, the cellular phone includes: a Radio Frequency (RF) circuit 1110, a memory 1120, an input unit 1130, a display unit 1140, a sensor 1150, an audio circuit 1160, a wireless fidelity (WiFi) module 1170, a processor 1180, and a power supply 1190. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 11:

RF circuit 1110 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages to processor 1180; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1110 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 1120 may be used to store software programs and modules, and the processor 1180 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1130 may include a touch panel 1131 and other input devices 1132. Touch panel 1131, also referred to as a touch screen, can collect touch operations of a user on or near the touch panel 1131 (for example, operations of the user on or near touch panel 1131 by using any suitable object or accessory such as a finger or a stylus pen), and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and can receive and execute commands sent by the processor 1180. In addition, the touch panel 1131 can be implemented by using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1130 may include other input devices 1132 in addition to the touch panel 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1140 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. The Display unit 1140 may include a Display panel 1141, and optionally, the Display panel 1141 may be configured in a form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1131 can cover the display panel 1141, and when the touch panel 1131 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1180 to determine the type of the touch event, and then the processor 1180 provides a corresponding visual output on the display panel 1141 according to the type of the touch event. Although in fig. 11, the touch panel 1131 and the display panel 1141 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1131 and the display panel 1141 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1141 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and a cell phone. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161; on the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which are then processed by the audio data output processor 1180, and then transmitted to, for example, another cellular phone via the RF circuit 1110, or output to the memory 1120 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the cell phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1170, and provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1180 is a control center of the mobile phone, and is connected to various parts of the whole mobile phone through various interfaces and lines, and executes various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the mobile phone. Optionally, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.

The phone also includes a power supply 1190 (e.g., a battery) for powering the various components, and preferably, the power supply may be logically connected to the processor 1180 via a power management system, so that the power management system may manage charging, discharging, and power consumption management functions.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 1180 included in the terminal device further has the following functions:

An embodiment of the present application further provides a computer-readable storage medium for storing a program code, where the program code is configured to execute any one implementation of a subtitle generating method described in the foregoing embodiments.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A subtitle generating method based on artificial intelligence is characterized by comprising the following steps:

performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively;

determining a time length of a silence segment between the plurality of speech segments;

determining whether the time length of a silent segment is greater than a preset time length in sequence from a target voice segment in the plurality of voice segments according to the sequence of a time axis of the audio stream, wherein the preset time length is determined according to the time length of an expression pause between sentences;

if the time length of the target mute segment is determined to be greater than the preset time length, adding a text corresponding to the voice segment between the target mute segment and the target voice segment into a text group to be processed; the text group to be processed at least comprises a text corresponding to the target voice fragment;

determining a subtitle text from the text group to be processed according to the separators in the text group to be processed; the separator in the text group to be processed is added according to text semantics;

determining the relative starting time of a first character in the caption text in the corresponding voice segment;

determining the starting time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative starting time and the time offset of the voice clip corresponding to the first character on the time axis of the audio stream;

determining the relative ending time of the last character in the caption text in the corresponding voice segment;

determining the end time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative end time and the time offset of the voice clip corresponding to the last character on the time axis of the audio stream;

taking the caption text as the caption of the time axis interval of the corresponding audio stream;

the formula for determining the subtitle text is as follows:

wherein L is_textFor the determined text length of the subtitles, L_silThe length of the text group to be processed; l is_segThe preset number is determined according to the length of the displayed caption; l is_puncThe method comprises the steps of obtaining a text length between a first character and a last separator in a preset number of characters in a text group to be processed or a text length between the first character and the last separator in the text group to be processed; l is_maxThe number of characters corresponding to the longest length of the displayed caption.

2. The method of claim 1, further comprising:

judging whether the number of characters of the text group to be processed is larger than a preset number, wherein the preset number is determined according to the length of a displayed caption;

if so, determining a caption text from the text group to be processed according to the separators in the text group to be processed;

and if not, determining the text group to be processed as the subtitle text.

3. The method according to claim 1, wherein the determining caption text from the group of texts to be processed according to separators in the group of texts to be processed comprises:

4. The method according to claim 2, wherein if it is determined that the number of characters in the to-be-processed text group is greater than the preset number and the to-be-processed text group does not include a separator, the method further comprises:

judging whether the number of characters of the text group to be processed is larger than the maximum number, wherein the maximum number is the number of characters corresponding to the longest length of the displayed caption;

if so, determining the maximum number of characters in front of the text group to be processed as the subtitle text;

and if not, determining the text group to be processed as the subtitle text.

5. The method of claim 1, further comprising:

translating the subtitle text according to the subtitle display language to obtain a translated subtitle text;

the taking the caption text as the caption of the time axis interval of the corresponding audio stream includes:

and taking the translated caption text as the caption of the time axis interval of the corresponding audio stream.

6. An artificial intelligence-based subtitle generating apparatus, comprising an obtaining unit, an identifying unit, a first determining unit, a second determining unit, a third determining unit, and a generating unit, the apparatus further comprising a sixth determining unit, a seventh determining unit, an eighth determining unit, and a ninth determining unit:

the recognition unit is used for performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively;

the third determining unit is configured to determine a time length of a silence segment between the plurality of voice segments;

the first determining unit is used for sequentially determining whether the time length of the silent segments is greater than a preset time length from the target voice segment according to the sequence of the time axis of the audio stream, and adding the text corresponding to the voice segment between the target silent segment and the target voice segment into a text group to be processed if the time length of the target silent segment is greater than the preset time length; the text group to be processed at least comprises the text of the target voice fragment, and the preset duration is determined according to the duration of the expression pause between sentences;

the second determining unit is used for determining a subtitle text from the text group to be processed according to the separators in the text group to be processed; the separator in the text group to be processed is added according to text semantics;

the sixth determining unit is configured to determine a relative start time of a first character in the subtitle text in the corresponding speech segment;

the seventh determining unit is configured to determine, according to the relative start time and a time offset of the voice segment corresponding to the first character on an audio stream time axis, a start time of an audio stream time axis interval corresponding to the subtitle text;

the eighth determining unit is configured to determine a relative end time of a last character in the subtitle text in the corresponding speech segment;

the ninth determining unit is configured to determine, according to the relative end time and a time offset of the voice segment corresponding to the last character on an audio stream time axis, an end time of an audio stream time axis interval corresponding to the subtitle text;

the generating unit is used for taking the caption text as the caption of the corresponding audio stream time axis interval;

wherein, the formula of the caption text determined by the second determining unit is:

7. The apparatus according to claim 6, further comprising a first judging unit and a fourth determining unit:

the first judging unit is used for judging whether the number of the characters of the text group to be processed is larger than a preset number, and the preset number is determined according to the length of the displayed caption;

if the first judging unit judges that the number of the characters of the text group to be processed is larger than the preset number, triggering the second determining unit to execute the step of determining the caption text from the text group to be processed according to the separators in the text group to be processed;

the fourth determining unit is configured to determine the text group to be processed as the subtitle text if the first determining unit determines that the number of characters of the text group to be processed is not greater than a preset number.

8. The apparatus according to claim 6, wherein the second determining unit is specifically configured to:

9. The apparatus according to claim 7, wherein if the first determining unit determines that the number of characters in the text group to be processed is greater than the preset number and the text group to be processed does not include a separator, the apparatus further comprises a second determining unit and a fifth determining unit:

the second judging unit is configured to judge whether the number of characters of the text group to be processed is greater than a maximum number, where the maximum number is a number of characters corresponding to a longest length of a displayed subtitle;

the fifth determining unit is configured to determine, if the second determining unit determines that the number of characters of the text group to be processed is greater than the maximum number, the maximum number of characters before the text group to be processed as the subtitle text;

and if the second judging unit judges that the number of the characters of the text group to be processed is not more than the maximum number, triggering the fourth determining unit to execute the step of determining the text group to be processed as the subtitle text.

10. An apparatus for artificial intelligence based subtitle generation, the apparatus comprising a processor and a memory:

the processor is configured to execute the artificial intelligence based subtitle generating method of any one of claims 1-5 according to instructions in the program code.

11. A computer-readable storage medium for storing program code, which when executed by a processor, is configured to perform the artificial intelligence based subtitle generating method of any one of claims 1-5.