CN110381389B - Subtitle generating method and device based on artificial intelligence - Google Patents
Subtitle generating method and device based on artificial intelligence Download PDFInfo
- Publication number
- CN110381389B CN110381389B CN201910740413.6A CN201910740413A CN110381389B CN 110381389 B CN110381389 B CN 110381389B CN 201910740413 A CN201910740413 A CN 201910740413A CN 110381389 B CN110381389 B CN 110381389B
- Authority
- CN
- China
- Prior art keywords
- text
- processed
- subtitle
- voice
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 17
- 239000012634 fragment Substances 0.000 claims abstract description 85
- 230000014509 gene expression Effects 0.000 claims abstract description 34
- 238000012545 processing Methods 0.000 abstract description 25
- 238000005516 engineering process Methods 0.000 abstract description 21
- 238000003058 natural language processing Methods 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 14
- 230000008569 process Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 8
- 238000013519 translation Methods 0.000 description 8
- 230000001413 cellular effect Effects 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 230000002354 daily effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 208000027765 speech disease Diseases 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440236—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computer Security & Cryptography (AREA)
- Studio Circuits (AREA)
Abstract
The embodiment of the application discloses a subtitle generating method and device based on artificial intelligence, at least relates to a voice processing technology and a natural language processing technology in the artificial intelligence, and aims at a plurality of voice fragments to obtain texts respectively corresponding to the voice fragments through voice recognition and determine the time length of a mute fragment. And according to the sequence of the time axis of the audio stream, sequentially determining whether the duration of the silent segments is greater than the preset duration from the target voice segment, adding the text corresponding to the voice segment between the target silent segment greater than the preset duration and the target voice segment into a text group to be processed, and taking the separator in the text group to be processed as the basis for determining the subtitle text. The text part between the separators belongs to complete sentences, reasonable semantics can be embodied, and whether the silent segments are expression pauses among the sentences can be determined according to the preset duration, so that the possibility of incomplete sentences appearing in the subtitle text is reduced, and users watching audios and videos can be helped to understand audio and video contents.
Description
The application provides divisional application to a Chinese patent application with the application number of 201811355311.4, the application date of 2018, 11 and 14, and the invention name of 'a subtitle generating method and device'.
Technical Field
The present application relates to the field of audio processing, and in particular, to a subtitle generating method and apparatus based on artificial intelligence.
Background
When watching some audios and videos such as live webcasts and movies, a user can understand audio and video contents through subtitles displayed on an audio and video display picture.
In a conventional audio and video subtitle generation manner, an audio stream is mainly processed according to a silence segment so as to generate a subtitle. The mute segment may be a segment without voice in the audio stream of the audio and video, and the audio stream is segmented into a plurality of voice segments according to the mute segment, wherein a text corresponding to the voice in any one of the voice segments may be generated into a subtitle of the voice segment.
However, since the conventional method simply segments the audio stream according to a single audio signal feature, i.e. a silence segment, it is difficult to distinguish an expression pause within a sentence in a character expression from an expression pause between sentences, so that an inappropriate speech segment is often segmented, and thus the generated subtitle will include an incomplete sentence, which is difficult to help a user understand audio and video contents, and may even mislead the user, resulting in a bad experience.
Disclosure of Invention
In order to solve the technical problem, the application provides a subtitle generating method and device, the possibility that incomplete sentences appear in subtitle texts determined by separators is greatly reduced, and when the subtitle texts are displayed as subtitles in a corresponding audio stream time axis interval, users watching audio and video can be helped to understand audio and video contents, and user experience is improved.
The embodiment of the application discloses the following technical scheme:
in a first aspect, an embodiment of the present application provides a method for generating a subtitle, where the method includes:
acquiring a plurality of voice fragments which come from the same audio stream and are segmented according to the mute fragments;
performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively, wherein the texts corresponding to the voice fragments respectively comprise separators added according to text semantics;
when a subtitle is determined according to a text corresponding to a target voice fragment in the voice fragments, determining a text group to be processed, wherein the text group to be processed at least comprises a text corresponding to the target voice fragment;
determining a subtitle text from the text group to be processed according to the separators in the text group to be processed;
and taking the subtitle text as the subtitle of the time axis interval of the corresponding audio stream.
In a second aspect, an embodiment of the present application provides a subtitle generating apparatus, which includes an obtaining unit, an identifying unit, a first determining unit, a second determining unit, and a generating unit:
the acquiring unit is used for acquiring a plurality of voice fragments which are from the same audio stream and are segmented according to the mute fragments;
the recognition unit is used for performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively, and the texts corresponding to the voice fragments respectively comprise separators added according to text semantics;
the first determining unit is configured to determine a to-be-processed text group when a subtitle is determined according to a text corresponding to a target speech segment in the plurality of speech segments, where the to-be-processed text group at least includes a text of the target speech segment;
the second determining unit is used for determining a subtitle text from the text group to be processed according to the separators in the text group to be processed;
and the generating unit is used for taking the subtitle text as the subtitle of the corresponding audio stream time axis interval.
In a third aspect, an embodiment of the present application provides an apparatus for subtitle generation, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the subtitle generating method according to any one of the first aspect according to instructions in the program code.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program codes, where the program codes are used to execute the subtitle generating method according to any one of the first aspects.
According to the technical scheme, in the process of generating the subtitles according to the plurality of voice fragments which are from the same audio stream and segmented according to the mute fragments, the plurality of voice fragments are subjected to voice recognition to obtain texts corresponding to the plurality of voice fragments respectively, and the texts corresponding to the plurality of voice fragments respectively comprise separators added according to text semantics. When the subtitle is determined according to the text corresponding to the target voice segment, a text group to be processed for generating the subtitle is determined, and the text group to be processed at least comprises the text corresponding to the target voice segment. After the text group to be processed is determined, the separator in the text group to be processed can be used as a basis for determining the caption text from the text group to be processed, and as the separator in the text group to be processed is added based on semantics when the Chinese characters in the speech fragment are recognized, and the text part among the separators belongs to complete sentences, reasonable semantics can be embodied, the possibility that incomplete sentences appear in the caption text determined by the separator is greatly reduced, and when the caption text is displayed as the caption in the time axis interval of the corresponding audio stream, the caption text can help users watching audio and video to understand audio and video contents, and user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic view of an application scenario of a subtitle generating method according to an embodiment of the present application;
FIG. 2 is a diagram illustrating a relationship between an audio stream, a silence segment, and a speech segment according to an embodiment of the present application;
fig. 3 is a flowchart of a subtitle generating method according to an embodiment of the present application;
fig. 4 is a flowchart of a method for determining a pending text group according to an embodiment of the present application;
fig. 5 is a flowchart of a method for generating subtitles according to a time axis interval of an audio stream according to a subtitle text according to an embodiment of the present application;
fig. 6 is an exemplary diagram for determining a time axis interval of an audio stream corresponding to a subtitle text according to an embodiment of the present application;
fig. 7 is a flowchart of a subtitle generating method according to an embodiment of the present application;
fig. 8 is a flowchart illustrating a structure of subtitle generation according to an embodiment of the present disclosure;
fig. 9a is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;
fig. 9b is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;
fig. 9c is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;
fig. 9d is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;
fig. 9e is a structural diagram of a subtitle generating apparatus according to an embodiment of the present application;
fig. 10 is a block diagram of an apparatus for generating subtitles according to an embodiment of the present application;
fig. 11 is a block diagram of an apparatus for generating subtitles according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In a conventional subtitle generating method, an audio stream is mainly processed according to a silence segment to generate a subtitle. The silent segment can reflect the pause between sentences of the user during expression to a certain extent, but different users have different expression habits, and some users may have expression pause in a sentence. For example, in the sentence "two children play and hide in the clear day", wherein a blank space indicates an expression pause in the "clear day", the expression pause occurs between the "clear day" and the "clear day" in the sentence due to the expression habit of the user or the user's thinking when expressing the sentence, etc.
If the audio stream where the child is playing and hiding on the bright day is divided into voice segments, one voice segment corresponds to one subtitle, so that the child is playing and hiding on the bright day as one subtitle, and the generated subtitle comprises an incomplete sentence. When the subtitles are displayed, the user first sees the subtitles "in this sunny day" and then sees the subtitles "two children are playing and hiding" on the bright day ", so that the understanding of the user may be affected, and bad experience may be caused.
Therefore, the embodiment of the application provides a subtitle generating method, which adopts a new subtitle generating method on the basis of cutting an audio stream into a plurality of voice segments according to a mute segment, and takes separators as the basis for determining subtitle texts.
The subtitle generating method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, a method, a technology and an application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned voice processing technology and natural language processing technology.
For example, the present invention may relate to a Speech recognition Technology in Speech Technology (Speech Technology), which includes Speech signal preprocessing (Speech signal preprocessing), Speech signal frequency domain analysis (Speech signal analysis), Speech signal feature extraction (Speech signal feature extraction), Speech signal feature matching/recognition (Speech signal feature matching/recognition), training of Speech (Speech training), and the like.
For example, Text preprocessing (Text preprocessing) and Machine Translation (Machine Translation) in Natural Language Processing (NLP) may be involved, including word, sentence segmentation (word/content segmentation), word tagging (word tagging), sentence classification (word/content classification), Translation word selection (word selection), sentence generation (content generation), word-activity (word-activity), editing and outputting (editing and outputting), and the like.
It can be understood that, compared with the conventional subtitle generating method, the subtitle generating method provided by the embodiment of the present application reduces the possibility of incomplete sentences appearing in the subtitle text, and does not need manual later-stage proofreading, so that the subtitle generating method provided by the embodiment of the present application can be applied to scenes requiring real time, such as live video, video chat, games, and the like, and of course, the subtitle generating method provided by the embodiment of the present application can also be applied to non-live scenes, for example, subtitles can be generated for recorded audio and video files.
The subtitle generating method provided by the embodiment of the application can be applied to audio and video processing equipment with subtitle generating capacity, and the audio and video processing equipment can be terminal equipment or a server.
The audio-video processing device may have the capability to implement automatic speech recognition techniques (ASR) and voiceprint recognition techniques, among others, in speech technology. The audio and video processing equipment can listen, see and feel, and the development direction of man-machine interaction in the future is provided, wherein voice becomes one of the best viewed man-machine interaction modes in the future.
In the embodiment of the application, the audio and video processing device can perform voice recognition on the acquired voice segments by implementing the voice technology, so that functions such as texts corresponding to the voice segments are obtained.
The audio and video processing device can also have the capability of implementing Natural Language Processing (NLP), and the NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
In the embodiment of the application, the audio and video processing device can realize the process of determining the caption text through the determined text, and the functions of translating the caption text and the like by implementing the NLP technology.
If the audio/video processing device is a terminal device, the terminal device may be an intelligent terminal, a computer, a Personal Digital Assistant (PDA for short), a tablet computer, or the like.
If the audio and video processing device is a server, the server may be an independent server or a cluster server. When the server obtains the caption text by using the caption generating method, the caption text is used as the caption of the corresponding audio stream time axis interval to be displayed on the terminal equipment corresponding to the user, so that the real-time display of the caption in the video live broadcasting process is realized.
In order to facilitate understanding of the technical solution of the present application, the following describes a subtitle generating method provided in the embodiments of the present application with reference to an actual application scenario.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a subtitle generating method according to an embodiment of the present application. The application scenario is introduced by taking the example that the subtitle generating method is applied to a server (audio and video processing equipment is the server). The application scenario includes the server 101, and the server 101 may obtain a plurality of voice segments, such as voice segment 1, voice segment 2, voice segment 3, and the like in fig. 1, which are derived from the same audio stream and are obtained according to the generation time sequence of the voice segments, and are segmented according to the silence segments.
The audio stream comprises voice sent by a character in an object to be processed, the object to be processed can be audio and video generated based on a live scene, and can also be a determined audio and video file, such as a recorded and downloaded audio and video file, and the object to be processed comprises the audio stream; the voice uttered by the character may refer to a live player speaking in a live scene, or may be a played audio file including voice, such as recording, playing a song, and the like.
A speech segment may refer to a portion of an audio stream that includes speech information; the silent segment may refer to a portion of the audio stream without voice information, and may represent an expression pause within a sentence or an expression pause between sentences that appears when the user expresses the speech.
The relationship among the audio stream, the silent sections and the voice sections can be shown in fig. 2, and as can be seen from fig. 2, for the audio stream corresponding to the time points 0-t1 on the time axis, in the process of acquiring the audio stream, the audio stream can be divided into a plurality of voice sections according to the acquired silent sections, for example, voice section 1, voice section 2, voice section 3 and voice section 4 in fig. 2.
It should be noted that the voice segment may be segmented by the server according to the silence segment when the server acquires the audio stream, or the server may directly acquire the segmented voice segment.
The server 101 performs voice recognition on the acquired voice fragments to obtain texts corresponding to the voice fragments, wherein the texts corresponding to the voice fragments respectively comprise separators added according to text semantics.
Because the separators are added based on the semantics when the Chinese characters in the voice fragment are recognized, the text part between the separators belongs to the complete sentence, and the reasonable semantics can be embodied, so that the possibility of incomplete sentences in the caption text determined by the separators in the following process is greatly reduced.
The delimiters may include punctuation marks and special symbols, wherein the punctuation marks may include periods, commas, exclamation marks, question marks, and the like; the special symbols may include space characters, underlines, vertical lines, diagonal lines, and the like.
When the server 101 determines a subtitle according to a text corresponding to a certain voice segment, for example, a target voice segment, in a plurality of voice segments (fig. 1, the voice segment 2 is taken as the target voice segment), a to-be-processed text group may be determined, where the to-be-processed text group at least includes a text corresponding to the target voice segment.
It should be noted that, in the embodiment of the present application, the subtitle text is determined based on the separators for each voice segment, when determining the subtitle according to the text corresponding to the target voice segment, since the target voice segment is not necessarily the first processed voice segment in the audio stream, when determining the subtitle text based on the separators for the last time, a portion of text corresponding to the target voice segment may have been already used for determining the subtitle text, so that the text corresponding to the target voice segment may not be the entire text corresponding to the target voice segment, but may be the remaining portion of text corresponding to the target voice segment after the subtitle is generated for the last time.
Taking "two children play and hide on this clear day" as an example, the text corresponding to the segmented voice segments may be "two children play and hide on this clear day", where "," is a separator, and when determining the caption text based on the separator "," the text corresponding to the voice segments "on the clear day", the text corresponding to the two children play and hide "on the clear day", and the text corresponding to the voice segments "on the clear" are generated together, so that, when the voice segments "on the clear day" and two children play and hide "as the target voice segments, the text corresponding to the target voice segments is" two children play and hide ", and is the partial text corresponding to the target voice segments.
Of course, the text corresponding to the target speech segment may also be all texts corresponding to the target speech segment, which is not limited in this embodiment of the application.
For example, the target speech segment is a speech segment that is processed for the first time with respect to the audio stream, or when the subtitle text is determined according to the delimiter at the last time, the text corresponding to the target speech segment is not used for generating the subtitle text, and the text corresponding to the target speech segment is all the text corresponding to the target speech segment.
It should be noted that the text group to be processed may only include the text corresponding to the target speech segment, or may also include texts corresponding to a plurality of speech segments including the text corresponding to the target speech segment. When the text group to be processed includes texts corresponding to a plurality of speech segments, the text group to be processed may be formed by splicing a text corresponding to the target speech segment and a text corresponding to one or more speech segments subsequent to the target speech segment, and how to determine the text group to be processed is described later. And then determining a subtitle text through the separators based on the text group to be processed, and further generating the subtitles in the time axis interval of the corresponding audio stream.
In this embodiment, the caption text may be recognized based on the language type in the speech segment, but the caption is not limited to the language type in the speech segment, and the language type of the caption may be determined based on the user requirement, may be the language type in the speech segment, may be another language type, and may further include multiple language types. For example, if the subtitle text is english, the displayed subtitle may be an english subtitle, a chinese subtitle, or a chinese subtitle.
Next, a subtitle generating method provided by an embodiment of the present application will be described with reference to the accompanying drawings.
Referring to fig. 3, fig. 3 shows a flowchart of a subtitle generating method, which includes:
s301, a plurality of voice segments which are from the same audio stream and are segmented according to the mute segments are obtained.
The obtained voice segments segmented according to the silence segments may include many voice segments, which may belong to different audio streams, and in this embodiment, the obtained voice segments are voice segments from the same audio stream and are obtained sequentially according to the generation time sequence of the voice segments.
S302, carrying out voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively.
When the texts corresponding to the multiple speech segments are obtained through speech recognition, separators can be added to the texts corresponding to the multiple speech segments based on text semantic addition, so that subtitle texts can be determined through the separators in the following process.
And S303, determining a text group to be processed when determining the subtitles according to the text corresponding to the target voice fragment in the plurality of voice fragments.
Aiming at the voice segments of the same audio stream, the texts corresponding to the voice segments need to be processed according to the generation time sequence of the voice segments, the texts corresponding to the voice segments according to which the subtitles are based, namely the texts corresponding to the target voice segments, are currently determined, and the determined text group to be processed at least comprises the texts of the target voice segments.
The segmentation of the speech segments may be based on the expression pause between sentences or within a sentence, and in order to reduce the possibility that the text set to be processed includes incomplete sentences due to the expression pause in a sentence, the embodiment provides a method for determining the text set to be processed.
Referring to fig. 4, the method includes:
s401, determining the time length of a silence segment among the voice segments.
The time length of the silent segment can reflect to a certain extent whether the silent segment is an expression pause between sentences or an expression pause in a sentence. Generally, the time length of the silence segment generated by the expression pause in a sentence is relatively small, and the time length of the silence segment generated by the expression pause between sentences is relatively large, so that which voice segments are possibly spliced with the target voice segment to form the text group to be processed can be known according to the determined time length of the silence segment.
The time length of the silence segment may be determined by: when acquiring the voice segment, recording the ending time stamp T of the voice segment aiming at the current voice segmentsil_beginAnd the start time stamp T of the next speech segmentsil_endSequentially calculating the time length T of the silence segment after the current voice segmentsilI.e. Tsil=Tsil_end-Tsil_begin。
S402, according to the sequence of the time axis of the audio stream, sequentially determining whether the time length of the mute section is greater than a preset time length from the target voice section.
The preset duration is determined according to the duration of the expression pause between sentences in the expression of the user under normal conditions, and whether the silent segment is the expression pause between sentences or the expression pause in a sentence can be determined according to the preset duration.
Referring to fig. 2, according to the sequence of the time axis of the audio stream, a voice segment 1, a mute segment a, a voice segment 2, a mute segment B, a voice segment 3, a mute segment C, and a voice segment 4 are arranged in sequence, wherein the time length of the mute segment a is Tsil-1The time length of the mute segment B is Tsil-2The time length of the mute segment C is Tsil-3If the speech segment 1 is a target speech segment, it is necessary to sequentially determine whether the time length of the silence segment is greater than a preset time length from the silence segment a, if not, the silence segment may be an expression pause in a sentence, then continuously determine whether the time length of the silence segment B is greater than the preset time length, and so on until the time length of a certain silence segment is determined to be greater than the preset time length, at this time, the silence segment may be an expression pause between sentences, that is, texts corresponding to the first and the last speech segments of the silence segment may be in two different sentences.
S403, if the time length of the target mute segment is determined to be greater than the preset time length, adding the text corresponding to the voice segment between the target mute segment and the target voice segment into the text group to be processed.
In the process of sequentially determining whether the time length of the silence segment is greater than the preset time length, the time lengths of the silence segments before the target silence segment are all less than the preset time length, and texts corresponding to the voice segments between the target silence segment and the target voice segment may be in the same sentence, so the texts corresponding to the voice segments can be spliced. Once it is determined that the time length of a silence segment (e.g., a target silence segment) is greater than the preset duration, the step of determining whether the time length of the silence segment is greater than the preset duration may be stopped, and in order to reduce the possibility that the text group to be processed includes an incomplete sentence due to a pause in expression in a sentence, texts corresponding to the speech segment between the target silence segment and the target speech segment may be spliced to form the text group to be processed.
Referring to FIG. 2, if T is determined in turnsil-1Less than a predetermined duration, Tsil-2Less than a predetermined duration, and Tsil-3If the length of time is longer than the preset length of time, it may be considered that the silent segment a and the silent segment B may be an expression pause in a sentence, and the silent segment C may be an expression pause between sentences, and the silent segment C may be used as a target silent segment, and the texts corresponding to the speech segment 1, the speech segment 2, and the speech segment 3 may be spliced into a text group to be processed.
The method sequentially determines whether the silent segments behind the target voice segment represent the expression pause in a sentence through the time length of the silent segments, so that texts corresponding to the voice segments separated by the expression pause in the sentence are spliced to form a text group to be processed, and the possibility that the text group to be processed comprises an incomplete sentence due to the expression pause in the sentence is reduced.
S304, determining a subtitle text from the text group to be processed according to the separators in the text group to be processed.
Because the separators in the text group to be processed are added based on semantics when the Chinese characters in the voice fragment are recognized, the text part between the separators belongs to the complete sentence, and the reasonable semantics can be embodied, the possibility that the incomplete sentence appears in the caption text determined by the separators is greatly reduced.
For example, the text corresponding to the speech segment 1 is "in this fine day", the text corresponding to the speech segment 2 is "in this fine day, two children play and hide", when the speech segment 1 is taken as the target speech segment, it is determined through S303 that the text group to be processed is "in this fine day, two children play and hide", where "is a separator, and then" in this fine day "can be determined as the caption text according to the separator in the text group to be processed; when the processing is continued, "on the bright day" in the text corresponding to the voice fragment 2 is used for the previous processing to generate the caption text, but the text "two children are playing and hiding" still remains in the text corresponding to the voice fragment 2, then when the caption is determined according to the text corresponding to the voice fragment 2 (target voice fragment), the text corresponding to the target voice fragment is the partial text "two children are playing and hiding" corresponding to the voice fragment 2, but not "on the bright day" and two children are playing and hiding ", at this time, S303-S305 are continuously performed for the text" two children are playing and hiding "corresponding to the target voice fragment.
Compared with the traditional mode, in the traditional mode, the text corresponding to the voice fragment 1 is in the clear day, the text corresponding to the voice fragment 2 is in the bright day, the two children play and hide and respectively correspond to a caption text, and the two caption texts comprise incomplete sentences.
And S305, taking the subtitle text as the subtitle of the corresponding audio stream time axis interval.
When the caption text is used as the caption of the corresponding audio stream time axis interval for displaying, the user watching the audio and video can be helped to understand the audio and video content, and the user experience is improved.
According to the technical scheme, in the process of generating the subtitles according to the plurality of voice fragments which are from the same audio stream and segmented according to the mute fragments, the plurality of voice fragments are subjected to voice recognition to obtain texts corresponding to the plurality of voice fragments respectively, and the texts corresponding to the plurality of voice fragments respectively comprise separators added according to text semantics. When the subtitle is determined according to the text corresponding to the target voice segment, a text group to be processed for generating the subtitle is determined, and the text group to be processed at least comprises the text corresponding to the target voice segment. After the text group to be processed is determined, the separator in the text group to be processed can be used as a basis for determining the caption text from the text group to be processed, and as the separator in the text group to be processed is added based on semantics when the Chinese characters in the speech fragment are recognized, and the text part among the separators belongs to complete sentences, reasonable semantics can be embodied, the possibility that incomplete sentences appear in the caption text determined by the separator is greatly reduced, and when the caption text is displayed as the caption in the time axis interval of the corresponding audio stream, the caption text can help users watching audio and video to understand audio and video contents, and user experience is improved.
The foregoing embodiments describe a subtitle generating method, and during generating a subtitle, a subtitle text needs to be determined from a text group to be processed according to a delimiter, and since the text group to be processed and the delimiter in the text group to be processed may have different situations, for example, when determining the subtitle text, it may be necessary to consider a length of a displayed subtitle and determine, according to which delimiter, that the subtitle text is appropriate, and a manner of determining the subtitle text may be different under different situations. In the present embodiment, the subtitle text may be determined with reference to the following formula in different cases:
wherein L istextMay indicate the determined text length, L, of the subtitlesilCan represent the text length of the text group to be processed; l issegCan represent the preset number, is determined according to the length of the displayed caption; l ispuncThe length of the text between the first character and the last separator in a preset number of characters before the text group to be processed can be taken as the length of the text between the first character and the last separator in the text group to be processed; l ismaxThe maximum number may be the number of characters corresponding to the longest length of the displayed subtitles.
Based on the above formula, the appropriate subtitle text can be determined under different circumstances. Next, the way of determining the subtitle text from the text group to be processed under different conditions will be described one by one.
The first case may be that it is to be processedThe text length of the text group is less than a predetermined number, i.e. Lsil<LsegAt this time, the formula L can be usedtext=LsilSubtitle text is determined.
Specifically, in a general case, the length of the displayed subtitle is affected by the font size of the subtitle, the size of the display screen, the user experience, and the like, and the displayed subtitle needs to have a reasonable length, that is, the length of the displayed subtitle. The display subtitle length may be expressed in terms of the number of characters in the display subtitle. Thus, after the text group to be processed is obtained, whether the number of characters of the text group to be processed is larger than the preset number or not can be judged, namely L is judgedsilWhether or not it is greater than LsegThe preset number is determined according to the length of the displayed caption, and the preset number is the number of characters in the displayed caption when the preset number accords with the length of the displayed caption; if not, the text group to be processed can be considered to meet the requirement of displaying the caption length, and the text group to be processed can be directly determined as the caption text, namely Ltext=Lsil。
The second case may be that the text length of the group of texts to be processed is greater than a preset number and there is a separator, i.e. Lsil>Lseg&Lpunc>0, in this case, the formula Ltext=LpuncSubtitle text is determined.
If the number of the characters of the text group to be processed is larger than the preset number, L issil>LsegIf the number of characters in the text group to be processed is too large, the text group to be processed needs to be intercepted, so that the subtitle text meeting the requirement of displaying the subtitle length is obtained from the text group to be processed, and if the separator exists in the text group to be processed, the L is the Lpunc>0, S304 may be executed to determine the caption text, i.e., Ltext=Lpunc。
It should be noted that, the manner of determining the caption text according to the delimiter has been briefly described in the embodiment (S304) corresponding to fig. 3, and next, how to determine the caption text from the text group to be processed according to the delimiter, that is, how to determine Lpunc。
It should be noted that determining the subtitle text from the text group to be processed according to the delimiter includes two determination manners, where a first determination manner may be: determining the part between the first character and the last separator in the text group to be processed as a caption text, namely LpuncThe length of the text from the first character to the last delimiter in the group of texts to be processed.
For example, the pending text group is "on a sunny day, two children are playing and hiding, and they are happy to play. However, according to a first determination, the first character in the group of texts to be processed is "at" and the last delimiter is ". ", then," at "and". The section between can be as the subtitle text, that is, the subtitle text is "in sunny days, two children play and hide in the game, and they play with great fun. ".
However, in some cases, in order to further ensure that the caption text determined from the text group to be processed according to the delimiter meets the requirement of displaying the caption length, the caption text may be determined from the text group to be processed according to the delimiter while taking the length of the displayed caption into consideration, that is, the second determination method may be: determining a part between a first character in the text group to be processed and a last separator in a preset number of characters before the text group to be processed as a subtitle text, wherein the preset number is determined according to the length of a displayed subtitle, namely LpuncThe length of the text from the first character in the text group to be processed to the last separator in the preset number of characters before the text group to be processed is determined.
For example, the pending text group is "on a sunny day, two children are playing and hiding, and they are happy to play. However, the preset number is 25, according to a second determination manner, the first character in the text group to be processed is "on", the 25 th character is "on", the first character is "on" and the last separator from "to" the first 25 characters in the text group to be processed is "second", and then "between" and "second", the part between "and" second "may be used as the caption text, that is, the caption text is" in clear days, two children play and hide ", and". Therefore, the subtitle text determined by the second determination method comprises 19 characters, the requirement for displaying the subtitle length is met, and user experience is better.
The third case may be that the text length of the text group to be processed is greater than the preset number and there is no delimiter, i.e., Lsil>Lseg&LpuncWhen the value is 0, the formula L can be usedtext=min(Lsil,Lmax) Subtitle text is determined.
It should be noted that the premise of determining the caption text from the text group to be processed according to the separator in the text group to be processed in S304 is that the separator is included in the text group to be processed, however, in some cases, the separator may not be included in the text group to be processed, for example, the text group to be processed may be "that the family address of a friend wearing red clothes is" home address of south street, 5 street, 3 unit, 2 building 301 room, guan cun, respectively, in the hai lake of beijing city ". Then, when the number of characters of the text group to be processed is greater than the preset number and the text group to be processed does not include separators, the method for determining the subtitle text from the text group to be processed is introduced.
The length of the displayed caption is a reasonable length of the caption when the caption is displayed, and the length of the caption is limited by the longest length of the displayed caption, so that the caption text can be determined according to the longest length of the displayed caption besides the length of the displayed caption. The number of the characters of the text group to be processed is larger than the preset number, which only indicates that the number of the characters of the text group to be processed exceeds the length of the caption displayed under the ordinary condition, but does not indicate that the text group to be processed cannot be accepted, that is, does not indicate that the text group to be processed cannot be used as the caption text, as long as the number of the characters of the text group to be processed does not exceed the number of the characters corresponding to the longest length of the caption displayed.
Specifically, when it is determined whether the number of characters of the to-be-processed text group is greater than the preset number and the to-be-processed text group does not include the delimiter, it may be further determined whether the number of characters of the to-be-processed text group is greater than the maximum number, that is, L is determinedsilWhether or not it is greater than LmaxSaid maximum number LmaxThe number of characters corresponding to the longest length of the displayed caption; if so, it is indicated that the number of characters of the text group to be processed exceeds the maximum length of the accepted displayed caption, and a part of characters need to be cut out from the text group to be processed as a caption text, for example, the maximum number of characters before the text group to be processed can be determined as the caption text; if not, the number of the characters of the text group to be processed is within the longest length of the accepted displayed caption, and the text group to be processed can be directly determined as the caption text, so that the text with smaller text length is determined from the text group to be processed as the caption text, namely Ltext=min(Lsil,Lmax)。
For example, the text group to be processed is "the home address of the child wearing red clothes is" the home address of the child is "building 301 room of department 2, building 3, and room 5 in the south guancun avenue of the hai lake region of beijing city", the maximum number is 30, at this time, the number of characters of the text group to be processed is 43, then, the number of characters of the text group to be processed is 43 greater than the maximum number 30, the character at the front 30 of the text group to be processed can be determined as the caption text, that is, the caption text is "the home address of the child wearing red clothes is" the south guancun avenue of the south guan province of the hai lake region of beijing city ".
For another example, the text group to be processed may be "the family address of the friend is the south avenue 5 th yard in the hai lake area of beijing", the maximum number is 30, at this time, the number of characters of the text group to be processed is 26, then, the number of characters of the text group to be processed is less than the maximum number 30, then the text group to be processed may be determined as the subtitle text, that is, the subtitle text is "the family address of the friend is the south avenue 5 th yard in the hai lake area of beijing".
The purpose of determining the subtitle text is to generate subtitles for the corresponding audio stream, and then, how to generate subtitles for the time axis interval of the corresponding audio stream according to the subtitle text will be described.
It should be noted that, in the conventional method for generating subtitles, only the silent segment is used to segment and determine the subtitle text, and then the subtitle corresponding to the time axis interval of the audio stream is generated according to the subtitle text, so that only the time offset of the voice segment needs to be recorded during the segmentation. In the embodiment of the present application, when determining the subtitle text from the text group to be processed according to the delimiter, the subtitle text may be subdivided according to the delimiter, and it is difficult to ensure the accuracy of the determined subtitle text at the corresponding time on the time axis only by the time offset of the speech segment. Therefore, the present embodiment provides a method for generating subtitles according to a time axis interval of a corresponding audio stream according to a subtitle text, and referring to fig. 5, the method includes:
s501, determining the relative starting time of the first character in the caption text in the corresponding voice segment.
And S502, determining the starting time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative starting time and the time offset of the voice clip corresponding to the first character on the time axis of the audio stream.
S503, determining the relative end time of the last character in the caption text in the corresponding voice segment.
S504, determining the end time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative end time and the time offset of the voice clip corresponding to the last character on the time axis of the audio stream.
Therefore, the subtitle of the corresponding audio stream time axis interval can be generated according to the subtitle text according to the starting time and the ending time of the audio stream time axis interval corresponding to the subtitle text.
It will be appreciated that in subdividing the subtitle text according to the delimiters, the relative start and end times of each character in the subtitle text may be determined by the speech recognition engine. The expression format of the relative start time and the relative end time of each character can be as follows:
for example, it can be determined that Word _1 in the subtitle text has a relative start time start of 500ms and a relative end time end of 750ms, and so on.
Referring to fig. 6, if the determined subtitle text is as shown between A, B in the figure, where a is located at a position corresponding to a first character in the subtitle text, B is located at a position corresponding to a last character in the subtitle text, a start time of an audio stream time axis interval corresponding to the subtitle text is a time corresponding to the position a, and an end time of the audio stream time axis interval corresponding to the subtitle text is a time corresponding to the position B.
As can be seen from fig. 6, the relative start time of the first character in the corresponding voice segment is t1, the voice segment corresponding to the first character is the voice segment 2, and the time offset of the voice segment 2 on the time axis of the audio stream is t2, so that, according to the relative start time t1 and the time offset t2 of the voice segment 2 on the time axis of the audio stream, the start time of the time axis interval of the audio stream corresponding to the subtitle text is t1+ t 2; the relative end time of the last character in the corresponding voice segment is t3, the voice segment corresponding to the last character is the voice segment 3, and the time offset of the voice segment 3 on the time axis of the audio stream is t4, so that the end time of the time axis interval of the audio stream corresponding to the subtitle text can be determined to be t3+ t4 according to the relative end time t3 and the time offset t4 of the voice segment 3 on the time axis of the audio stream.
The method also needs to combine the relative time of the characters in the corresponding voice clips on the basis of the time offset of the voice clips, so that the accuracy of the determined corresponding time of the subtitle text on a time axis is ensured.
It can be understood that, in many cases, the language of the voice segment in the audio/video is not the language used by the user on a daily basis, and at this time, in order to help the user watching the audio/video understand the audio/video content, the subtitle text as the subtitle should be represented by the language used by the user on a daily basis. Therefore, in this embodiment, the subtitle text determined in S304 may also be translated according to the subtitle display language to obtain a translated subtitle text, and the translated subtitle text is used as a subtitle of the time axis interval of the corresponding audio stream.
The caption display language can be set by a user according to the requirement, for example, the language of a voice fragment in audio and video is english, and the user is a Chinese person, so the caption display language can be Chinese, so that the caption text with the english language can be translated into the caption text with the Chinese language, and the caption text with the Chinese language is used as the translated caption text, so that the caption text with the Chinese language is used as the caption text in the corresponding audio stream time axis interval, and the user can understand the audio and video content conveniently.
Next, a description will be given of a subtitle generating method provided in this embodiment of the present application with reference to a specific scene, where the scene is a live video scene for a speaker, and it is assumed that the speaker uses english to perform speech, and then to help a viewer watching a live video to understand speech content of the speaker, a subtitle needs to be generated in real time for the speech of the speaker, and at this time, the generated subtitle may be a chinese-english subtitle in order to facilitate the viewer to learn and understand. In this scenario, referring to fig. 7, the subtitle generating method includes:
s701, obtaining a plurality of voice segments which are from the same audio stream and segmented according to the mute segments.
S702, determining the time length of a mute segment among the voice segments.
And S703, performing voice recognition on the plurality of voice segments to obtain texts corresponding to the plurality of voice segments respectively.
S704, when determining the subtitles according to the texts corresponding to the target voice clips in the plurality of voice clips, sequentially determining whether the time length of the mute clip is greater than a preset time length from the target voice clip according to the sequence of the time axis of the audio stream, if so, executing S705, and if not, executing S704.
S705, adding the text corresponding to the voice segment between the target mute segment and the target voice segment into the text group to be processed.
S706, judging whether the number of the characters of the text group to be processed is larger than a preset number, if so, executing S707, and if not, executing S711.
S707, determining whether the text group to be processed includes a separator, if so, executing S708, and if not, executing S709.
S708, determining a subtitle text from the text group to be processed according to the separators in the text group to be processed.
And S709, judging whether the number of the characters of the text group to be processed is larger than the maximum number, if so, executing S710, and if not, executing S711.
And S710, determining the maximum number of characters in front of the text group to be processed as the subtitle text.
And S711, determining the text group to be processed as the subtitle text.
And S712, translating the subtitle text through machine translation.
And S713, taking the caption text and the translated caption text as the caption of the corresponding audio stream time axis interval.
In this application scenario, the structure flow of subtitle generation can be seen in fig. 8. The segmentation based on the silence segment to obtain the voice segment 1, … … voice segment 4, and the like corresponds to S701 in fig. 7; based on the mute segment/semantic combination, the text corresponding to the voice segment is divided again, so as to obtain a subtitle text, which corresponds to S702-S711 in fig. 7; translating the subtitle text by machine translation to obtain a translated subtitle text, for example, performing machine translation on the subtitle text 1 to obtain a subtitle text 1', and the like, which corresponds to S712 in fig. 7; the audio stream time axis and the caption text obtained by machine translation are merged with the time axis according to the audio stream time axis to generate a corresponding caption, which corresponds to S713 in fig. 7. After the subtitles are obtained, the subtitles can be pushed to be played in real time.
According to the technical scheme, in the process of generating the subtitles according to the plurality of voice fragments which are from the same audio stream and segmented according to the mute fragments, the plurality of voice fragments are subjected to voice recognition to obtain texts corresponding to the plurality of voice fragments respectively, and the texts corresponding to the plurality of voice fragments respectively comprise separators added according to text semantics. When the subtitle is determined according to the text corresponding to the target voice segment, a text group to be processed for generating the subtitle is determined, and the text group to be processed at least comprises the text corresponding to the target voice segment. After the text group to be processed is determined, the separator in the text group to be processed can be used as a basis for determining the caption text from the text group to be processed, and as the separator in the text group to be processed is added based on semantics when the Chinese characters in the speech fragment are recognized, and the text part among the separators belongs to complete sentences, reasonable semantics can be embodied, the possibility that incomplete sentences appear in the caption text determined by the separator is greatly reduced, and when the caption text is displayed as the caption in the time axis interval of the corresponding audio stream, the caption text can help users watching audio and video to understand audio and video contents, and user experience is improved.
Based on a subtitle generating method provided by the foregoing embodiment, the present embodiment provides a subtitle generating apparatus 900, referring to fig. 9a, the apparatus 900 includes an obtaining unit 901, an identifying unit 902, a first determining unit 903, a second determining unit 904, and a generating unit 905:
the acquiring unit 901 is configured to acquire a plurality of voice segments that are derived from the same audio stream and segmented according to the silence segment;
the recognition unit 902 is configured to perform speech recognition on the multiple speech segments to obtain texts corresponding to the multiple speech segments, where the texts corresponding to the multiple speech segments include separators added according to text semantics;
the first determining unit 903 is configured to determine a to-be-processed text group when processing a text of a target speech segment in the plurality of speech segments, where the to-be-processed text group at least includes a text of the target speech segment;
the second determining unit 904 is configured to determine a subtitle text from the to-be-processed text group according to the separator in the to-be-processed text group;
the generating unit 905 is configured to use the subtitle text as a subtitle of a time axis interval of the corresponding audio stream.
In one implementation, referring to fig. 9b, the apparatus 900 further comprises a third determining unit 906:
the third determining unit 906, configured to determine a time length of a silence segment between the plurality of voice segments;
the first determining unit 903 is specifically configured to determine, according to an order of a time axis of an audio stream, whether a time length of a silence segment is greater than a preset time length from the target voice segment in sequence;
and if the time length of the target mute segment is determined to be greater than the preset time length, adding the text corresponding to the voice segment between the target mute segment and the target voice segment into the text group to be processed.
In one implementation, referring to fig. 9c, the apparatus 900 further includes a first determining unit 907 and a fourth determining unit 908:
the first determining unit 907 is configured to determine whether the number of characters in the to-be-processed text group is greater than a preset number, where the preset number is determined according to a length of a displayed subtitle;
if the first determining unit 907 determines that the number of characters in the to-be-processed text group is greater than the preset number, the second determining unit 904 is triggered to execute the step of determining the subtitle text from the to-be-processed text group according to the separators in the to-be-processed text group;
the fourth determining unit 908 is configured to determine the text group to be processed as the subtitle text if the first determining unit 907 determines that the number of characters of the text group to be processed is not greater than a preset number.
In an implementation manner, the second determining unit 904 is specifically configured to:
determining a part from the first character to the last separator in the text group to be processed as a subtitle text; or,
and determining a part between a first character in the text group to be processed and a last separator in a preset number of characters before the text group to be processed as a subtitle text, wherein the preset number is determined according to the length of a displayed subtitle.
In one implementation, if the first determining unit 907 determines that the number of characters in the to-be-processed text group is greater than the preset number and the to-be-processed text group does not include a separator, referring to fig. 9d, the apparatus 900 further includes a second determining unit 909 and a fifth determining unit 910:
the second determining unit 909 is configured to determine whether the number of characters of the to-be-processed text group is greater than a maximum number, where the maximum number is a number of characters corresponding to a longest length of a displayed subtitle;
the fifth determining unit 910 is configured to determine, if the second determining unit 909 determines that the number of characters of the text group to be processed is greater than the maximum number, the maximum number of characters before the text group to be processed is determined as the subtitle text;
if the second determining unit 909 determines that the number of characters in the to-be-processed text group is not greater than the maximum number, the fourth determining unit 908 is triggered to execute the step of determining the to-be-processed text group as the subtitle text.
In one implementation, referring to fig. 9e, the apparatus 900 further includes a sixth determining unit 911, a seventh determining unit 912, an eighth determining unit 913, and a ninth determining unit 914:
the sixth determining unit 911 is configured to determine a relative starting time of a first character in the subtitle text in the corresponding speech segment;
the seventh determining unit 912, configured to determine a starting time of an audio stream time axis interval corresponding to the subtitle text according to the relative starting time and a time offset of the voice segment corresponding to the first character on the audio stream time axis;
the eighth determining unit 913 is configured to determine a relative end time of the last character in the subtitle text in the corresponding speech segment;
the ninth determining unit 914 is configured to determine the ending time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative ending time and the time offset of the voice segment corresponding to the last character on the time axis of the audio stream.
According to the technical scheme, in the process of generating the subtitles according to the plurality of voice fragments which are from the same audio stream and segmented according to the mute fragments, the plurality of voice fragments are subjected to voice recognition to obtain texts corresponding to the plurality of voice fragments respectively, and the texts corresponding to the plurality of voice fragments respectively comprise separators added according to text semantics. When the subtitle is determined according to the text corresponding to the target voice segment, a text group to be processed for generating the subtitle is determined, and the text group to be processed at least comprises the text corresponding to the target voice segment. After the text group to be processed is determined, the separator in the text group to be processed can be used as a basis for determining the caption text from the text group to be processed, and as the separator in the text group to be processed is added based on semantics when the Chinese characters in the speech fragment are recognized, and the text part among the separators belongs to complete sentences, reasonable semantics can be embodied, the possibility that incomplete sentences appear in the caption text determined by the separator is greatly reduced, and when the caption text is displayed as the caption in the time axis interval of the corresponding audio stream, the caption text can help users watching audio and video to understand audio and video contents, and user experience is improved.
The embodiment of the present application further provides an apparatus for generating subtitles, which is described below with reference to the accompanying drawings. Referring to fig. 10, an embodiment of the present application provides an apparatus 1000 for subtitle generation, where the apparatus 1000 may be a server, may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1022 (e.g., one or more processors) and a memory 1032, and one or more storage media 1030 (e.g., one or more mass storage devices) storing an application program 1042 or data 1044. Memory 1032 and storage medium 1030 may be, among other things, transient or persistent storage. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1022 may be provided in communication with the storage medium 1030, and execute a series of instruction operations in the storage medium 1030 on the apparatus 1000 for subtitle generation.
The apparatus 1000 for subtitle generation may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 10.
The CPU 1022 is configured to execute the following steps:
acquiring a plurality of voice fragments which come from the same audio stream and are segmented according to the mute fragments;
performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively, wherein the texts corresponding to the voice fragments respectively comprise separators added according to text semantics;
when a subtitle is determined according to a text corresponding to a target voice fragment in the voice fragments, determining a text group to be processed, wherein the text group to be processed at least comprises a text corresponding to the target voice fragment;
determining a subtitle text from the text group to be processed according to the separators in the text group to be processed;
and taking the subtitle text as the subtitle of the time axis interval of the corresponding audio stream.
Referring to fig. 11, an apparatus 1100 for generating subtitles is provided in an embodiment of the present application, where the apparatus 1100 may also be a terminal apparatus, and the terminal apparatus may be any terminal apparatus including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal apparatus is a mobile phone:
fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 11, the cellular phone includes: a Radio Frequency (RF) circuit 1110, a memory 1120, an input unit 1130, a display unit 1140, a sensor 1150, an audio circuit 1160, a wireless fidelity (WiFi) module 1170, a processor 1180, and a power supply 1190. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 11:
The memory 1120 may be used to store software programs and modules, and the processor 1180 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 1130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1130 may include a touch panel 1131 and other input devices 1132. Touch panel 1131, also referred to as a touch screen, can collect touch operations of a user on or near the touch panel 1131 (for example, operations of the user on or near touch panel 1131 by using any suitable object or accessory such as a finger or a stylus pen), and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and can receive and execute commands sent by the processor 1180. In addition, the touch panel 1131 can be implemented by using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1130 may include other input devices 1132 in addition to the touch panel 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1140 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. The Display unit 1140 may include a Display panel 1141, and optionally, the Display panel 1141 may be configured in a form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1131 can cover the display panel 1141, and when the touch panel 1131 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1180 to determine the type of the touch event, and then the processor 1180 provides a corresponding visual output on the display panel 1141 according to the type of the touch event. Although in fig. 11, the touch panel 1131 and the display panel 1141 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1131 and the display panel 1141 may be integrated to implement the input and output functions of the mobile phone.
The handset may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1141 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
WiFi belongs to short-distance wireless transmission technology, and the cell phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1170, and provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1180 is a control center of the mobile phone, and is connected to various parts of the whole mobile phone through various interfaces and lines, and executes various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the mobile phone. Optionally, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.
The phone also includes a power supply 1190 (e.g., a battery) for powering the various components, and preferably, the power supply may be logically connected to the processor 1180 via a power management system, so that the power management system may manage charging, discharging, and power consumption management functions.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In this embodiment, the processor 1180 included in the terminal device further has the following functions:
acquiring a plurality of voice fragments which come from the same audio stream and are segmented according to the mute fragments;
performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively, wherein the texts corresponding to the voice fragments respectively comprise separators added according to text semantics;
when a subtitle is determined according to a text corresponding to a target voice fragment in the voice fragments, determining a text group to be processed, wherein the text group to be processed at least comprises a text corresponding to the target voice fragment;
determining a subtitle text from the text group to be processed according to the separators in the text group to be processed;
and taking the subtitle text as the subtitle of the time axis interval of the corresponding audio stream.
An embodiment of the present application further provides a computer-readable storage medium for storing a program code, where the program code is configured to execute any one implementation of a subtitle generating method described in the foregoing embodiments.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (11)
1. A subtitle generating method based on artificial intelligence is characterized by comprising the following steps:
acquiring a plurality of voice fragments which come from the same audio stream and are segmented according to the mute fragments;
performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively;
determining a time length of a silence segment between the plurality of speech segments;
determining whether the time length of a silent segment is greater than a preset time length in sequence from a target voice segment in the plurality of voice segments according to the sequence of a time axis of the audio stream, wherein the preset time length is determined according to the time length of an expression pause between sentences;
if the time length of the target mute segment is determined to be greater than the preset time length, adding a text corresponding to the voice segment between the target mute segment and the target voice segment into a text group to be processed; the text group to be processed at least comprises a text corresponding to the target voice fragment;
determining a subtitle text from the text group to be processed according to the separators in the text group to be processed; the separator in the text group to be processed is added according to text semantics;
determining the relative starting time of a first character in the caption text in the corresponding voice segment;
determining the starting time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative starting time and the time offset of the voice clip corresponding to the first character on the time axis of the audio stream;
determining the relative ending time of the last character in the caption text in the corresponding voice segment;
determining the end time of the time axis interval of the audio stream corresponding to the subtitle text according to the relative end time and the time offset of the voice clip corresponding to the last character on the time axis of the audio stream;
taking the caption text as the caption of the time axis interval of the corresponding audio stream;
the formula for determining the subtitle text is as follows:
wherein L istextFor the determined text length of the subtitles, LsilThe length of the text group to be processed; l issegThe preset number is determined according to the length of the displayed caption; l ispuncThe method comprises the steps of obtaining a text length between a first character and a last separator in a preset number of characters in a text group to be processed or a text length between the first character and the last separator in the text group to be processed; l ismaxThe number of characters corresponding to the longest length of the displayed caption.
2. The method of claim 1, further comprising:
judging whether the number of characters of the text group to be processed is larger than a preset number, wherein the preset number is determined according to the length of a displayed caption;
if so, determining a caption text from the text group to be processed according to the separators in the text group to be processed;
and if not, determining the text group to be processed as the subtitle text.
3. The method according to claim 1, wherein the determining caption text from the group of texts to be processed according to separators in the group of texts to be processed comprises:
determining a part from the first character to the last separator in the text group to be processed as a subtitle text; or,
and determining a part between a first character in the text group to be processed and a last separator in a preset number of characters before the text group to be processed as a subtitle text, wherein the preset number is determined according to the length of a displayed subtitle.
4. The method according to claim 2, wherein if it is determined that the number of characters in the to-be-processed text group is greater than the preset number and the to-be-processed text group does not include a separator, the method further comprises:
judging whether the number of characters of the text group to be processed is larger than the maximum number, wherein the maximum number is the number of characters corresponding to the longest length of the displayed caption;
if so, determining the maximum number of characters in front of the text group to be processed as the subtitle text;
and if not, determining the text group to be processed as the subtitle text.
5. The method of claim 1, further comprising:
translating the subtitle text according to the subtitle display language to obtain a translated subtitle text;
the taking the caption text as the caption of the time axis interval of the corresponding audio stream includes:
and taking the translated caption text as the caption of the time axis interval of the corresponding audio stream.
6. An artificial intelligence-based subtitle generating apparatus, comprising an obtaining unit, an identifying unit, a first determining unit, a second determining unit, a third determining unit, and a generating unit, the apparatus further comprising a sixth determining unit, a seventh determining unit, an eighth determining unit, and a ninth determining unit:
the acquiring unit is used for acquiring a plurality of voice fragments which are from the same audio stream and are segmented according to the mute fragments;
the recognition unit is used for performing voice recognition on the voice fragments to obtain texts corresponding to the voice fragments respectively;
the third determining unit is configured to determine a time length of a silence segment between the plurality of voice segments;
the first determining unit is used for sequentially determining whether the time length of the silent segments is greater than a preset time length from the target voice segment according to the sequence of the time axis of the audio stream, and adding the text corresponding to the voice segment between the target silent segment and the target voice segment into a text group to be processed if the time length of the target silent segment is greater than the preset time length; the text group to be processed at least comprises the text of the target voice fragment, and the preset duration is determined according to the duration of the expression pause between sentences;
the second determining unit is used for determining a subtitle text from the text group to be processed according to the separators in the text group to be processed; the separator in the text group to be processed is added according to text semantics;
the sixth determining unit is configured to determine a relative start time of a first character in the subtitle text in the corresponding speech segment;
the seventh determining unit is configured to determine, according to the relative start time and a time offset of the voice segment corresponding to the first character on an audio stream time axis, a start time of an audio stream time axis interval corresponding to the subtitle text;
the eighth determining unit is configured to determine a relative end time of a last character in the subtitle text in the corresponding speech segment;
the ninth determining unit is configured to determine, according to the relative end time and a time offset of the voice segment corresponding to the last character on an audio stream time axis, an end time of an audio stream time axis interval corresponding to the subtitle text;
the generating unit is used for taking the caption text as the caption of the corresponding audio stream time axis interval;
wherein, the formula of the caption text determined by the second determining unit is:
wherein L istextFor the determined text length of the subtitles, LsilThe length of the text group to be processed; l issegThe preset number is determined according to the length of the displayed caption; l ispuncThe method comprises the steps of obtaining a text length between a first character and a last separator in a preset number of characters in a text group to be processed or a text length between the first character and the last separator in the text group to be processed; l ismaxThe number of characters corresponding to the longest length of the displayed caption.
7. The apparatus according to claim 6, further comprising a first judging unit and a fourth determining unit:
the first judging unit is used for judging whether the number of the characters of the text group to be processed is larger than a preset number, and the preset number is determined according to the length of the displayed caption;
if the first judging unit judges that the number of the characters of the text group to be processed is larger than the preset number, triggering the second determining unit to execute the step of determining the caption text from the text group to be processed according to the separators in the text group to be processed;
the fourth determining unit is configured to determine the text group to be processed as the subtitle text if the first determining unit determines that the number of characters of the text group to be processed is not greater than a preset number.
8. The apparatus according to claim 6, wherein the second determining unit is specifically configured to:
determining a part from the first character to the last separator in the text group to be processed as a subtitle text; or,
and determining a part between a first character in the text group to be processed and a last separator in a preset number of characters before the text group to be processed as a subtitle text, wherein the preset number is determined according to the length of a displayed subtitle.
9. The apparatus according to claim 7, wherein if the first determining unit determines that the number of characters in the text group to be processed is greater than the preset number and the text group to be processed does not include a separator, the apparatus further comprises a second determining unit and a fifth determining unit:
the second judging unit is configured to judge whether the number of characters of the text group to be processed is greater than a maximum number, where the maximum number is a number of characters corresponding to a longest length of a displayed subtitle;
the fifth determining unit is configured to determine, if the second determining unit determines that the number of characters of the text group to be processed is greater than the maximum number, the maximum number of characters before the text group to be processed as the subtitle text;
and if the second judging unit judges that the number of the characters of the text group to be processed is not more than the maximum number, triggering the fourth determining unit to execute the step of determining the text group to be processed as the subtitle text.
10. An apparatus for artificial intelligence based subtitle generation, the apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the artificial intelligence based subtitle generating method of any one of claims 1-5 according to instructions in the program code.
11. A computer-readable storage medium for storing program code, which when executed by a processor, is configured to perform the artificial intelligence based subtitle generating method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910740413.6A CN110381389B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811355311.4A CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
CN201910740413.6A CN110381389B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811355311.4A Division CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110381389A CN110381389A (en) | 2019-10-25 |
CN110381389B true CN110381389B (en) | 2022-02-25 |
Family
ID=65389096
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910741161.9A Active CN110418208B (en) | 2018-11-14 | 2018-11-14 | Subtitle determining method and device based on artificial intelligence |
CN201910740413.6A Active CN110381389B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
CN201910740405.1A Active CN110381388B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
CN201811355311.4A Active CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910741161.9A Active CN110418208B (en) | 2018-11-14 | 2018-11-14 | Subtitle determining method and device based on artificial intelligence |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910740405.1A Active CN110381388B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
CN201811355311.4A Active CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
Country Status (1)
Country | Link |
---|---|
CN (4) | CN110418208B (en) |
Families Citing this family (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797632B (en) * | 2019-04-04 | 2023-10-27 | 北京猎户星空科技有限公司 | Information processing method and device and electronic equipment |
CN112037768B (en) * | 2019-05-14 | 2024-10-22 | 北京三星通信技术研究有限公司 | Speech translation method, device, electronic equipment and computer readable storage medium |
CN110379413B (en) * | 2019-06-28 | 2022-04-19 | 联想(北京)有限公司 | Voice processing method, device, equipment and storage medium |
CN110400580B (en) * | 2019-08-30 | 2022-06-17 | 北京百度网讯科技有限公司 | Audio processing method, apparatus, device and medium |
CN110648653A (en) * | 2019-09-27 | 2020-01-03 | 安徽咪鼠科技有限公司 | Subtitle realization method, device and system based on intelligent voice mouse and storage medium |
CN110933485A (en) * | 2019-10-21 | 2020-03-27 | 天脉聚源(杭州)传媒科技有限公司 | Video subtitle generating method, system, device and storage medium |
CN110660393B (en) * | 2019-10-31 | 2021-12-03 | 广东美的制冷设备有限公司 | Voice interaction method, device, equipment and storage medium |
CN110992960A (en) * | 2019-12-18 | 2020-04-10 | Oppo广东移动通信有限公司 | Control method, control device, electronic equipment and storage medium |
CN112750425B (en) | 2020-01-22 | 2023-11-03 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, computer equipment and computer readable storage medium |
CN111639233B (en) * | 2020-05-06 | 2024-05-17 | 广东小天才科技有限公司 | Learning video subtitle adding method, device, terminal equipment and storage medium |
CN111353038A (en) * | 2020-05-25 | 2020-06-30 | 深圳市友杰智新科技有限公司 | Data display method and device, computer equipment and storage medium |
CN111832279B (en) * | 2020-07-09 | 2023-12-05 | 抖音视界有限公司 | Text partitioning method, apparatus, device and computer readable medium |
CN111986654B (en) * | 2020-08-04 | 2024-01-19 | 云知声智能科技股份有限公司 | Method and system for reducing delay of voice recognition system |
CN111916053B (en) * | 2020-08-17 | 2022-05-20 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN111968657B (en) * | 2020-08-17 | 2022-08-16 | 北京字节跳动网络技术有限公司 | Voice processing method and device, electronic equipment and computer readable medium |
CN112188241A (en) * | 2020-10-09 | 2021-01-05 | 上海网达软件股份有限公司 | Method and system for real-time subtitle generation of live stream |
CN112272277B (en) * | 2020-10-23 | 2023-07-18 | 岭东核电有限公司 | Voice adding method and device in nuclear power test and computer equipment |
CN113886612A (en) * | 2020-11-18 | 2022-01-04 | 北京字跳网络技术有限公司 | Multimedia browsing method, device, equipment and medium |
CN112686018B (en) * | 2020-12-23 | 2024-08-23 | 中国科学技术大学 | Text segmentation method, device, equipment and storage medium |
CN113099292A (en) * | 2021-04-21 | 2021-07-09 | 湖南快乐阳光互动娱乐传媒有限公司 | Multi-language subtitle generating method and device based on video |
CN112995736A (en) * | 2021-04-22 | 2021-06-18 | 南京亿铭科技有限公司 | Speech subtitle synthesis method, apparatus, computer device, and storage medium |
CN113225618A (en) * | 2021-05-06 | 2021-08-06 | 阿里巴巴新加坡控股有限公司 | Video editing method and device |
CN113343675B (en) * | 2021-06-30 | 2024-09-06 | 北京搜狗科技发展有限公司 | Subtitle generation method and device and subtitle generation device |
CN113596579B (en) * | 2021-07-29 | 2023-04-07 | 北京字节跳动网络技术有限公司 | Video generation method, device, medium and electronic equipment |
CN113660432B (en) * | 2021-08-17 | 2024-05-28 | 安徽听见科技有限公司 | Translation subtitle making method and device, electronic equipment and storage medium |
CN113938758A (en) * | 2021-12-08 | 2022-01-14 | 沈阳开放大学 | Method for quickly adding subtitles in video editor |
CN114268829B (en) * | 2021-12-22 | 2024-01-16 | 中电金信软件有限公司 | Video processing method, video processing device, electronic equipment and computer readable storage medium |
CN114554238B (en) * | 2022-02-23 | 2023-08-11 | 北京有竹居网络技术有限公司 | Live broadcast voice simultaneous transmission method, device, medium and electronic equipment |
CN114626359A (en) * | 2022-03-24 | 2022-06-14 | 阳光保险集团股份有限公司 | Text display method and device of audio data |
CN115831120B (en) * | 2023-02-03 | 2023-06-16 | 北京探境科技有限公司 | Corpus data acquisition method and device, electronic equipment and readable storage medium |
CN116471436B (en) * | 2023-04-12 | 2024-05-31 | 央视国际网络有限公司 | Information processing method and device, storage medium and electronic equipment |
CN116612781B (en) * | 2023-07-20 | 2023-09-29 | 深圳市亿晟科技有限公司 | Visual processing method, device and equipment for audio data and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143331A (en) * | 2013-05-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Method and system for adding punctuations |
CN105244022A (en) * | 2015-09-28 | 2016-01-13 | 科大讯飞股份有限公司 | Audio and video subtitle generation method and apparatus |
CN105845129A (en) * | 2016-03-25 | 2016-08-10 | 乐视控股(北京)有限公司 | Method and system for dividing sentences in audio and automatic caption generation method and system for video files |
CN106331893A (en) * | 2016-08-31 | 2017-01-11 | 科大讯飞股份有限公司 | Real-time subtitle display method and system |
CN107632980A (en) * | 2017-08-03 | 2018-01-26 | 北京搜狗科技发展有限公司 | Voice translation method and device, the device for voiced translation |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6697564B1 (en) * | 2000-03-03 | 2004-02-24 | Siemens Corporate Research, Inc. | Method and system for video browsing and editing by employing audio |
KR100521914B1 (en) * | 2002-04-24 | 2005-10-13 | 엘지전자 주식회사 | Method for managing a summary of playlist information |
AU2003241204B2 (en) * | 2002-06-24 | 2009-02-05 | Lg Electronics Inc. | Recording medium having data structure including navigation control information for managing reproduction of video data recorded thereon and recording and reproducing methods and apparatuses |
CN100547670C (en) * | 2004-03-17 | 2009-10-07 | Lg电子株式会社 | Be used to reproduce recording medium, the method and apparatus of text subtitle stream |
DE202010018551U1 (en) * | 2009-03-12 | 2017-08-24 | Google, Inc. | Automatically deliver content associated with captured information, such as information collected in real-time |
US20150318020A1 (en) * | 2014-05-02 | 2015-11-05 | FreshTake Media, Inc. | Interactive real-time video editor and recorder |
US9898773B2 (en) * | 2014-11-18 | 2018-02-20 | Microsoft Technology Licensing, Llc | Multilingual content based recommendation system |
CN106878805A (en) * | 2017-02-06 | 2017-06-20 | 广东小天才科技有限公司 | Mixed language subtitle file generation method and device |
CN110444196B (en) * | 2018-05-10 | 2023-04-07 | 腾讯科技(北京)有限公司 | Data processing method, device and system based on simultaneous interpretation and storage medium |
-
2018
- 2018-11-14 CN CN201910741161.9A patent/CN110418208B/en active Active
- 2018-11-14 CN CN201910740413.6A patent/CN110381389B/en active Active
- 2018-11-14 CN CN201910740405.1A patent/CN110381388B/en active Active
- 2018-11-14 CN CN201811355311.4A patent/CN109379641B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143331A (en) * | 2013-05-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Method and system for adding punctuations |
CN105244022A (en) * | 2015-09-28 | 2016-01-13 | 科大讯飞股份有限公司 | Audio and video subtitle generation method and apparatus |
CN105845129A (en) * | 2016-03-25 | 2016-08-10 | 乐视控股(北京)有限公司 | Method and system for dividing sentences in audio and automatic caption generation method and system for video files |
CN106331893A (en) * | 2016-08-31 | 2017-01-11 | 科大讯飞股份有限公司 | Real-time subtitle display method and system |
CN107632980A (en) * | 2017-08-03 | 2018-01-26 | 北京搜狗科技发展有限公司 | Voice translation method and device, the device for voiced translation |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN110381388A (en) | 2019-10-25 |
CN109379641B (en) | 2022-06-03 |
CN110418208A (en) | 2019-11-05 |
CN110381389A (en) | 2019-10-25 |
CN109379641A (en) | 2019-02-22 |
CN110381388B (en) | 2021-04-13 |
CN110418208B (en) | 2021-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110381389B (en) | Subtitle generating method and device based on artificial intelligence | |
CN109344291B (en) | Video generation method and device | |
CN110598046B (en) | Artificial intelligence-based identification method and related device for title party | |
WO2021036644A1 (en) | Voice-driven animation method and apparatus based on artificial intelligence | |
CN110570840B (en) | Intelligent device awakening method and device based on artificial intelligence | |
CN108833969A (en) | A kind of clipping method of live stream, device and equipment | |
KR101870849B1 (en) | Information transmission method and transmission apparatus | |
CN110544488A (en) | Method and device for separating multi-person voice | |
CN109783798A (en) | Method, apparatus, terminal and the storage medium of text information addition picture | |
CN106203235B (en) | Living body identification method and apparatus | |
CN111816168B (en) | A model training method, a voice playback method, a device and a storage medium | |
CN109815363A (en) | Generation method, device, terminal and the storage medium of lyrics content | |
CN111324409B (en) | Artificial intelligence-based interaction method and related device | |
CN111491123A (en) | Video background processing method and device and electronic equipment | |
CN111314771B (en) | Video playing method and related equipment | |
CN110784762B (en) | Video data processing method, device, equipment and storage medium | |
CN110111795B (en) | Voice processing method and terminal equipment | |
CN109725798B (en) | Intelligent role switching method and related device | |
CN110750198A (en) | Expression sending method and mobile terminal | |
CN108491471B (en) | Text information processing method and mobile terminal | |
CN111611369A (en) | Interactive method based on artificial intelligence and related device | |
CN116708899B (en) | Video processing method, device and storage medium applied to virtual image synthesis | |
CN111723783B (en) | Content identification method and related device | |
CN116453005A (en) | Video cover extraction method and related device | |
CN112489619A (en) | Voice processing method, terminal device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |