CN115086753A

CN115086753A - Live video stream processing method and device, electronic equipment and storage medium

Info

Publication number: CN115086753A
Application number: CN202110282768.2A
Authority: CN
Inventors: 李秋平; 刘坚; 王明轩; 李磊
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-09-20

Abstract

The embodiment of the disclosure discloses a method, a device, electronic equipment and a storage medium for processing a live video stream, wherein the method comprises the following steps: acquiring a live video stream, and acquiring an audio stream in the live video stream; performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages; generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message; aligning the caption with the video frame of the live video stream in time according to the time information; and adding the subtitle corresponding to each text message into the live video stream to obtain the live video stream added with the subtitle. Which can achieve the purpose of adding subtitles to a live stream. The whole execution process is completed by one electronic device, the whole flow link is simple, and the reliability of the whole live video stream processing process can be improved.

Description

Live video stream processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a method and an apparatus for processing a live video stream, an electronic device, and a storage medium.

Background

With the continuous development of video live broadcast technology, the demand of users for live video streaming is also increasing. However, the existing live video stream has no subtitles, so that a user watching the live video stream cannot clearly know the content of the live video stream, and the user experience is reduced.

Disclosure of Invention

In order to solve the technical problems or at least partially solve the technical problems, embodiments of the present disclosure provide a method and an apparatus for processing a live video stream, an electronic device, and a storage medium.

The embodiment of the disclosure provides a method for processing a live video stream, which includes:

acquiring a live video stream, and acquiring an audio stream in the live video stream;

performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages;

generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message;

performing temporal alignment on the subtitles and the video frames of the live video stream according to the time information;

and adding the subtitle corresponding to each text message into the live video stream to obtain the live video stream added with the subtitle.

The embodiment of the present disclosure further provides a device for processing a live video stream, including:

the acquisition module is used for acquiring a live video stream and acquiring an audio stream in the live video stream;

the voice recognition module is used for carrying out voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages;

the generating module is used for generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message;

the alignment module is used for aligning the caption and the video frame of the live video stream in time according to the time information;

and the adding module is used for adding the subtitles corresponding to each text message into the live video stream to obtain the live video stream added with the subtitles.

An embodiment of the present disclosure further provides an electronic device, which includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of processing a live video stream as described above.

The disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the processing method of live video stream as described above.

The disclosed embodiments also provide a computer program product comprising a computer program or instructions which, when executed by a processor, implement the processing method of a live video stream as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has at least the following advantages:

the technical scheme provided by the embodiment of the disclosure includes acquiring a live video stream and acquiring an audio stream in the live video stream; performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages; generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message; performing temporal alignment on the subtitles and the video frames of the live video stream according to the time information; and adding the subtitle corresponding to each text message into the live video stream to obtain the live video stream added with the subtitle, so that a user watching the live video stream can clearly understand the content of the live video stream. The whole process can be completed by one electronic device without being completed by matching of a plurality of devices, the whole process is simple in link, the cost of caption distribution for the live video stream is low, and the reliability of the whole live video stream processing process can be improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic view of a usage scenario of a processing method of a live video stream according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a usage scenario of another live video stream processing method provided in an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a usage scenario of still another live video stream processing method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a usage scenario of another method for processing a live video stream according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a method for processing a live video stream according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another method for processing a live video stream according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a subtitle provided by an embodiment of the present disclosure;

fig. 8 is a screenshot of a certain video frame of a live video added with subtitles according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a display interface for displaying each text message and a target text corresponding to each text message according to an embodiment of the present disclosure;

fig. 10 is a flowchart of another method for processing a live video stream according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a device for processing a live video stream in an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an electronic device in an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

Fig. 1 is a schematic view of a usage scenario of a processing method for a live video stream according to an embodiment of the present disclosure. The processing method of the live video stream provided by the present disclosure can be applied to the application environment shown in fig. 1. Referring to fig. 1, the live video stream processing system includes a terminal 1, a server 2, a server 3, a server 6, and a terminal 7. Wherein, the terminal 1 and the server 2 are connected through network communication. The server 2 and the server 3 are connected through network communication. The server 3 and the server 6 are connected in communication through a network. The server 6 and the terminal 7 are connected through network communication. The terminal 1 is a terminal of a main broadcasting party, the terminal 1 uploads the live video stream to the server 2, and the live video stream does not carry subtitles at the moment. The server 3 pulls the live video stream from the server 2, and executes the processing method of the live video stream provided by the present disclosure to process the live video stream into a live video stream with subtitles, and then pushes the live video stream with subtitles to the server 6, and the server 6 can further send the live video stream with subtitles to the terminal 7. The terminal 7 is a terminal for watching live fans.

Fig. 2 is a schematic diagram of a usage scenario of another live video stream processing method provided in an embodiment of the present disclosure. The processing method of the live video stream provided by the present disclosure can be applied to the application environment shown in fig. 2. In contrast to fig. 1, the server 2 is absent from fig. 2, and the terminal 1 and the server 3 are communicatively connected via a network. The terminal 1 uploads the live video stream to the server 3, and the live video stream does not carry subtitles at the moment. The server 3 executes the processing method of the live video stream provided by the present disclosure with respect to the live video stream without subtitles acquired from the terminal 1.

Fig. 3 is a schematic diagram of a usage scenario of another live video stream processing method provided in an embodiment of the present disclosure. The processing method of the live video stream provided by the present disclosure can be applied to the application environment shown in fig. 3. In fig. 3, the server 3 is replaced with the terminal 4, and the other connection relationship is maintained as compared with fig. 1. Namely, the terminal 4 pulls the live video stream without the caption from the server 2, executes the processing method of the live video stream provided by the present disclosure to process it into the live video stream with the caption, and then pushes the live video stream with the caption to the server 6.

Fig. 4 is a schematic diagram of a usage scenario of another live video stream processing method provided in an embodiment of the present disclosure. The processing method of the live video stream provided by the present disclosure can be applied to the application environment shown in fig. 4. Compared with fig. 1, fig. 4 further includes a terminal 5, and the terminal 5 is communicatively connected with the server 3 through a network. The server 3 processes the live video stream without the subtitles into the live video stream with the subtitles by using the processing method of the live video stream provided by the present disclosure. The terminal 5 acquires the live video stream with the caption from the server 3 to perform manual verification, and transmits the live video stream with the caption subjected to manual verification to the server 3. The server 3 pushes the live video stream with the subtitles after manual verification to the server 6.

Alternatively, server 2 in fig. 1, server 3 in fig. 2, server 2 in fig. 3, and server 2 in fig. 4 are live servers, also referred to as origin servers. The servers 6 in fig. 1, 2, 3, and 4 are all CDN (Content Delivery Network) cache servers. The essence of this is to use CDN technology to complete the streaming of the live video stream from terminal 1 to terminal 7. It may further reduce the time it takes for the live video stream to be streamed from terminal 1 to terminal 7.

In each of the above usage scenarios, each server may be implemented by an independent server or a server cluster composed of a plurality of servers. The terminal may include, but is not limited to, a smart phone, a palm top computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, a kiosk, etc.

Fig. 5 is a flowchart of a live video stream processing method according to an embodiment of the present disclosure, where this embodiment is applicable to a case where subtitles need to be added to a live video stream, and the method may be executed by a live video stream processing apparatus, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be configured in an electronic device, such as a terminal, specifically including but not limited to a smart phone, a palm computer, a tablet computer, a wearable device with a display screen, a desktop, a notebook computer, an all-in-one machine, and the like. Alternatively, the embodiment may be applicable to a case where the server performs processing on a live video stream, and the method may be executed by a live video stream processing apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, such as a server. As shown in fig. 5, the method may specifically include:

s110, acquiring a live video stream, and acquiring an audio stream in the live video stream.

There are various ways to implement this step, and the present disclosure does not limit this. Illustratively, the implementation method of the step includes: and acquiring the live video stream according to the address information of the live video stream (namely the source stream address of the live video stream), and acquiring the audio stream in the live video stream. Alternatively, the address of the live video stream may be a URL (Uniform Resource Locator).

Exemplarily, referring to fig. 1, when a user needs to control the server 3 to execute the live video stream processing method provided by the present disclosure, the user inputs address information of the live video stream in the server 3 (optionally, the address information of the live video stream is in format such as rtmp), and the server 3 obtains the live video stream from the server 2 according to the address information of the live video stream; and then decoding and separating the live video stream to obtain the audio stream in the live video stream. Alternatively, decoding and separation may be performed synchronously; or the method can be divided into two steps, namely decoding is carried out firstly, and then the decoded live stream is separated.

And separating the live video stream to obtain an audio stream and a video stream. The audio stream includes only audio information and the video stream includes both audio and picture information. The video stream obtained after separation is the same as the live video stream before separation. The audio in the audio stream obtained after the separation is the same as the audio in the live video stream before the separation. Therefore, the essence of this step is to extract the audio information in the live video stream to form an audio stream.

S120, carrying out voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages.

One text message refers to a speech recognition result of a sentence or a paragraph. The time information corresponding to the text information refers to a timestamp corresponding to the text information, and includes start time information and end time information of the audio information corresponding to the text information in the live video stream or the audio stream.

There are various ways to implement this step, and the present disclosure does not limit this. Optionally, if the live video stream includes a plurality of video frames, the separated audio stream includes a plurality of audio frames, and the video frames correspond to the audio frames one to one. The video frame and the audio frame with corresponding relation have the same time stamp, and the implementation method of the step comprises the following steps: dividing the audio stream into a plurality of units to be identified by taking N (N is a positive integer) audio frames as a unit to be identified; respectively identifying each voice identification unit to form a plurality of text units; splicing and sentence-breaking all the text units; integrating all text units into one or more text messages by taking a sentence or a paragraph as a unit; and determining the time information corresponding to each text message based on the time stamp of the audio frame corresponding to each text message.

For example, assuming that 100 audio frames are used as one unit to be recognized for speech recognition, a certain audio stream may be divided into 3 units to be recognized, i.e., unit to be recognized 1, unit to be recognized 2, and unit to be recognized 3. The speech recognition result (i.e., text unit) of a unit to be recognized may include a half-word, a half-word, or two words, etc. Illustratively, the recognition result of the unit 1 to be recognized is "spring March flower in Pujiang suburb park spring art exhibition" which is a half-word. The recognition result of the unit to be recognized 2 is "the wonderful garden of 43 ten thousand square meters is revealed today in spring of a garden", which is a sentence and a half. The recognition result of the unit to be recognized 3 is "wonderful flower carpet made of tens of thousands of tulips of each color is as beautiful as peaceful peaches". After the texts in the voice recognition results (namely the text units) of all the units to be recognized are spliced and sentence-broken, the' three months in spring and with various flowers can be obtained. The spring art flower show in the Pujiang suburb park is revealed today. A 43 ten thousand square meters wonder garden appears to be spring full. The wonderful pattern tapestry made of the thousands of tulips with various colors is like a peacock spreading its tail and is beautiful. Further, the recognition results of the three units to be recognized can be integrated into four text messages by taking a sentence as a unit, and the text message 1 is 'March in spring and in full bloom'. The text information 2 is 'the spring art flower show of the Pujiang suburb park is uncovered today'. Text message 3 indicates that "43 ten thousand square meters of wonder garden shows spring full". The text information 4 is "wonderful carpet made by tens of thousands of tulips with various colors looks like peacock is flapped and wonderful". And finally, determining which audio frames each text message specifically corresponds to, taking the time stamp of the first audio frame corresponding to the text message as the start time information of the text message, and taking the time stamp of the last audio frame corresponding to the text message as the end time information of the text message. Exemplarily, it is assumed that the determined text information 1 corresponds to the first audio frame to the 53 th audio frame. The time stamp of the first audio frame is used as the start time information of the text information 1, and the time stamp of the 53 th audio frame is used as the end time information of the text information 1.

Further, there are various methods for implementing that the video frame and the audio frame having the corresponding relationship have the same time stamp, which is not limited in the present application. Illustratively, the timestamps of the video frames can be set to be generated together in the process of forming the live video stream, and the timestamps of the video frames are continuously increased along with the continuous formation of the live video frames; when the audio stream in the live video stream is acquired based on the live video stream, the same time stamp is added to the audio frame corresponding to each video frame based on the time stamp of each video frame in the live video stream.

Alternatively, there are various methods for implementing that the video frame and the audio frame having a correspondence have the same time stamp, which is not limited in this application. Illustratively, it may be set that the timestamps of the video frames are generated together in the process of acquiring the live video stream in S110, and the timestamps of the video frames are increased continuously with the increase of the acquired live video frames; when the audio stream in the live video stream is acquired based on the live video stream, the same time stamp is added to the audio frame corresponding to each video frame based on the time stamp of each video frame in the live video stream.

Or, the method for implementing that "the video frame and the audio frame having the corresponding relationship have the same timestamp" may further be that, when the audio stream in the live video stream is acquired based on the live video stream, a timestamp is added to each video frame and each audio frame in the separated live video stream at the same time, and the timestamps added to the video frame and the audio frame having the corresponding relationship are the same; according to the sequence of playing the live video frames, the time stamps of the video frames are continuously increased.

And S130, generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message.

Here, the caption is understood as a caption picture, that is, text information is stored in the form of a picture.

And S140, carrying out time alignment on the subtitles and the video frames of the live video stream according to the time information.

There are various implementation methods of this step, and for example, the specific implementation method of this step includes: and aligning the timestamp corresponding to each text message with the timestamp of the video frame in the live video stream to obtain the video frame corresponding to each text message in the live video stream.

Further, the step of aligning the timestamp corresponding to each text message with the timestamp of the video frame in the live video stream may include: and aligning the timestamp corresponding to each text message with the timestamp of a video frame in the live video stream according to the timestamp corresponding to each text message and the absolute time of the live video stream.

The absolute time of the live video stream may be a recording time of a first frame video frame in a live video stream recording process, may also be a time of acquiring the first frame video frame in a live video stream acquisition process when S110 is executed, and may also be a time of acquiring the first frame audio frame from the first video frame in an audio stream acquisition process by decoding and separating the live video stream.

Based on the absolute time of the live video stream, the timestamp of each video frame in the live video stream can be obtained, and therefore, the timestamp corresponding to each text message can be aligned with the timestamp of the video frame in the live video stream.

Illustratively, if the starting time of a certain text message is 20:50:02, the ending time is 20:50:50, the timestamp of the 5 th video frame in the live video stream is 20:50:02, and the timestamp of the 73 rd video frame is 20:50:50, it is obtained that the 5 th video frame to the 73 th video frame in the live video stream correspond to the text message.

And S150, adding the subtitles corresponding to each text message into the live video stream to obtain the live video stream added with the subtitles.

Illustratively, if a direct-play video stream corresponds to a text message from the 5 th video frame to the 73 th video frame, subtitles generated based on the text message are added to the 5 th video frame to the 73 th video frame, so as to obtain a live-play video stream added with subtitles.

Further, since the video stream includes both audio information and picture information, adding a subtitle corresponding to any text information in each text information to a video frame corresponding to any text information to obtain a live video stream added with subtitles, may include: and adding the subtitle corresponding to any text information in each text information into the picture of the video frame corresponding to any text information to obtain the live video stream added with the subtitle. Illustratively, adding the subtitle generated based on the text information 1 to the 5 th to 73 th video frames includes superimposing the subtitle generated based on the text information 1 in the form of a picture on a picture of each of the 5 th to 73 th video frames.

The essence of S140 and S150 is to merge subtitles with live video stream.

According to the technical scheme, the live video stream is obtained, and the audio stream in the live video stream is obtained; performing voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages; generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message; the method has the advantages that the subtitles corresponding to each text message are added into the live video stream, the live video stream added with the subtitles is obtained, the whole process can be completed by one electronic device without being completed by matching of a plurality of devices, the whole process is simple, the cost for configuring the subtitles for the live video stream is low, and the reliability of the whole processing process of the live video stream can be improved.

The technical scheme is particularly suitable for the situations of large conferences and live broadcasting in activities, live broadcasting in industry and academic conferences, live broadcasting in entertainment stars, live broadcasting in E-commerce and the like.

The technical scheme supports the broadcast transfer or the multi-terminal transfer. The multi-terminal forwarding refers to that an execution main body of the method sends the live video stream added with the subtitles to a plurality of servers, and the plurality of servers respectively push the live video stream added with the subtitles to the corresponding terminals.

It should be noted that, in practice, because the processing method of the live video stream needs to consume time in the execution process, the time when the live video stream is acquired in S110 is often earlier than the time when the live video stream added with the subtitles is acquired in S140, that is, a delay phenomenon exists between the two times.

Fig. 6 is a flowchart of another live video stream processing method according to an embodiment of the present disclosure. Fig. 6 is a specific example of fig. 5. Referring to fig. 6, the method may specifically include:

s210, acquiring a live video stream, and acquiring an audio stream in the live video stream.

S220, carrying out voice recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages.

And S230, translating each piece of text information into a target text of a target language.

The target language may be english, korean, chinese, japanese, etc.

Exemplarily, if the host explains in chinese during the live broadcast, the voices used by the live video stream and the audio stream are both in chinese. If the target language is english, the text information in S220 is presented in a chinese form, and the target text is presented in an english form.

And S240, generating a subtitle corresponding to each text message according to each text message, the target text corresponding to each text message and the time information corresponding to each text message.

Illustratively, fig. 7 is a schematic diagram of a subtitle provided by an embodiment of the present disclosure. Referring to fig. 7, the subtitle picture includes two sections of text with the same meaning. Of these two sections, one section is presented in chinese and the other in english.

And S250, carrying out time alignment on the subtitles and the video frames of the live video stream according to the time information.

And S260, adding the subtitles corresponding to each text message into the live video stream to obtain the live video stream added with the subtitles.

Illustratively, fig. 8 is a screenshot of a certain video frame of a live video added with subtitles according to an embodiment of the present disclosure. Referring to fig. 8, the video frame includes a picture in which a person is speaking, and the subtitle represents the content that the person is speaking. The user can understand the video content by means of subtitles.

According to the technical scheme, the step of translating each text message into the target text of the target language is added, and the subtitles corresponding to each text message are generated according to each text message, the target text corresponding to each text message and the time information corresponding to each text message, so that the watching requirements of users with different languages can be met, and the audience of live videos can be wider.

On the basis of the foregoing technical solution, optionally after S230, the method further includes: displaying each text message and a target text corresponding to each text message; and in response to the text information modification instruction and/or the target text modification instruction, modifying the text information pointed by the text information modification instruction and/or modifying the target text pointed by the target text modification instruction. This case applies to the case where the execution subject of the live video stream processing method is a terminal with a display screen. As shown in fig. 3, this case is applicable to the case where the main execution body of the processing method of the live video stream is the terminal 4. The text information modification instruction and/or the target text modification instruction are/is generated based on user operation. The essence of this is that the text information recognized in S220 and the target text translated in S230 are manually collated to improve the accuracy of the text information and/or the target text.

Fig. 9 is a schematic diagram of a display interface for displaying each piece of text information and a target text corresponding to each piece of text information according to an embodiment of the present disclosure. Illustratively, referring to fig. 9, the display interface includes four regions, where a region 1 is used to display the text information recognized in S220, a region 2 is used to display the target text translated in S230, a region 3 is used to display the playing progress of the audio stream or the live video stream, and a region 4 is used to configure the relevant parameters in the live process. The relevant parameters include, but are not limited to, the recognition language, the translation language, the source stream address of the live video stream, the push stream address of the live video stream. Optionally, the related parameters may also be set to include live delay, subtitle style, and the like.

On the basis of the foregoing technical solution, optionally after S230, the method further includes: sending each text message and a target text corresponding to each text message to first equipment; modified text information and/or modified target text is received from the first device. This applies to the case where the execution subject of the live video stream processing method is a server without a display screen. The first device is an electronic device with a display screen. As shown in fig. 4, the execution subject of the processing method for the live video stream in this case is the server 3, and the first device is the terminal 5. The essence of this is to facilitate manual collation of the text information recognized in S220 and the target text translated in S230, so as to improve the accuracy of the text information and/or the target text.

On the basis of the foregoing technical solutions, optionally, if the text information recognized in S220 and the target text obtained by translation in S230 are manually collated, S240 may include: and if any text information in each text information and the target text corresponding to any text information are modified, generating a subtitle corresponding to any text information according to the modified text information of any text information, the modified target text corresponding to any text information and the time information corresponding to any text information. This approach is suitable for situations where both the text information and the target text corresponding to the text information have problems.

On the basis of the foregoing technical solutions, optionally, if the text information recognized in S220 and the target text obtained by translation in S230 are manually collated, S240 may include: and if the target text corresponding to any text information in each text information is modified, generating a subtitle corresponding to any text information according to any text information, the modified target text corresponding to any text information and the time information corresponding to any text information. This method is suitable for a case where there is no problem with text information but there is a problem with a target text corresponding to the text information.

Fig. 10 is a flowchart of another method for processing a live video stream according to an embodiment of the present disclosure. Fig. 10 is a specific example of fig. 5. Referring to fig. 10, the method may specifically include:

s310, acquiring a live video stream, and acquiring an audio stream in the live video stream.

S320, carrying out voice recognition on a plurality of continuous audio frames in the audio stream to obtain text information corresponding to non-silent frames in the audio frames and time information corresponding to the text information, wherein the time information corresponding to the text information comprises time stamps of the non-silent frames corresponding to the text information.

A mute frame is an audio frame corresponding to a video frame generated by the anchor during muting. Therefore, the silent frame does not include contents to be subjected to speech recognition.

Optionally, the implementation method of this step may be: dividing the audio stream into a plurality of units to be identified by taking N audio frames as a unit to be identified; eliminating the mute frames in each unit to be identified to obtain the unit to be identified after elimination; respectively identifying the units to be identified after the elimination processing to form a plurality of text units; splicing and sentence-breaking all the text units; integrating all text units into one or more text messages by taking a sentence or a paragraph as a unit; and determining the time information corresponding to each text message based on the time stamp of the audio frame corresponding to each text message.

The essence of this step is to extract the non-silent frames in the audio stream and perform speech recognition on the extracted non-silent frames.

And S330, generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message.

S340, aligning the time stamp of the non-silent frame corresponding to each text message with the time stamp of the video frame in the live video stream to obtain the video frame corresponding to each text message in the live video stream.

The implementation method of this step has various types, and illustratively, the timestamp of the non-silent frame corresponding to each text message may be aligned with the timestamp of the video frame in the live video stream according to the timestamp of the non-silent frame corresponding to each text message and the absolute time of the live video stream.

And S350, adding the subtitle corresponding to any text information in each text information into the video frame corresponding to any text information to obtain the live video stream added with the subtitle.

The implementation method of this step is various, and exemplarily, the subtitle corresponding to any text information in each text information may be added to the picture of the video frame corresponding to any text information to obtain the live video stream added with the subtitle.

According to the technical scheme, the text information corresponding to the non-silent frames in the audio stream and the time information corresponding to the text information are obtained by performing voice recognition on the continuous audio frames in the audio stream, the time information corresponding to the text information comprises the time stamps of the non-silent frames corresponding to the text information, and the essence is that the non-silent frames in the audio stream are extracted, and the extracted non-silent frames are subjected to voice recognition, so that the time consumption of voice recognition of the whole audio stream can be reduced, and the time consumption of overall processing of the live video stream is further reduced.

On the basis of the foregoing technical solutions, optionally, after acquiring an audio stream in a live video stream, the method for processing the live video stream further includes: the format of the audio stream is converted to a format that can be recognized by automatic speech recognition techniques. The purpose of this is to ensure that the audio stream is successfully speech recognized, and thus ensure that the processing method of the live video stream provided by the present disclosure can be successfully executed.

On the basis of the foregoing technical solutions, optionally, the method for processing a live video stream further includes: and sending the live video stream added with the subtitles to second equipment, wherein the second equipment is used for sending the live video stream added with the subtitles to a user terminal. The purpose of this is to make the user terminal (i.e. the terminal watching the fans on the live broadcast) receive the live video stream with added subtitles finally.

Fig. 11 is a schematic structural diagram of a device for processing a live video stream in an embodiment of the present disclosure. The processing apparatus for live video stream provided by the embodiment of the present disclosure may be configured in a client, or may be configured in a server, and the processing apparatus for live video stream specifically includes:

an obtaining module 410, configured to obtain a live video stream, and obtain an audio stream in the live video stream;

a speech recognition module 420, configured to perform speech recognition on the audio stream to obtain one or more pieces of text information corresponding to the audio stream and time information corresponding to each piece of text information in the one or more pieces of text information;

a generating module 430, configured to generate a subtitle corresponding to each piece of text information according to each piece of text information and time information corresponding to each piece of text information;

an alignment module 440, configured to perform temporal alignment on the subtitles and the video frames of the live video stream according to the time information;

and an adding module 450, configured to add the subtitle corresponding to each piece of text information to the live video stream, so as to obtain the live video stream to which the subtitle is added.

Further, the device further comprises a translation module, configured to perform voice recognition on the audio stream, obtain one or more text messages corresponding to the audio stream, and obtain time information corresponding to each text message in the one or more text messages,

translating each text message into a target text of a target language;

a generating module 430, configured to generate a subtitle corresponding to each piece of text information according to each piece of text information, a target text corresponding to each piece of text information, and time information corresponding to each piece of text information.

Further, the device further comprises a first proofreading module, configured to display each piece of text information and a target text corresponding to each piece of text information after each piece of text information is translated into the target text in the target language;

and responding to a text information modification instruction and/or a target text modification instruction, and modifying the text information pointed by the text information modification instruction and/or modifying the target text pointed by the target text modification instruction.

Further, the device further comprises a second proofreading module, configured to send each piece of text information and the target text corresponding to each piece of text information to the first device after each piece of text information is translated into the target text in the target language;

modified text information and/or modified target text is received from the first device.

Further, the generating module 430 is configured to generate a subtitle corresponding to any text information according to the modified text information of any text information, the modified target text corresponding to any text information, and the time information corresponding to any text information if any text information in each text information and the target text corresponding to any text information are modified.

Further, the generating module 430 is configured to generate a subtitle corresponding to any text information according to any text information, the modified target text corresponding to any text information, and the time information corresponding to any text information if the target text corresponding to any text information in each text information is modified.

Further, the speech recognition module 420 is configured to perform speech recognition on a plurality of consecutive audio frames in the audio stream, so as to obtain text information corresponding to non-silent frames in the plurality of audio frames and time information corresponding to the text information, where the time information corresponding to the text information includes a timestamp of the non-silent frame corresponding to the text information.

Further, the aligning module 440 is configured to align a timestamp of a non-silent frame corresponding to each text information with a timestamp of a video frame in the live video stream, so as to obtain video frames in the live video stream corresponding to each text information;

the adding module 450 is configured to add a subtitle corresponding to any text information in each text information to a video frame corresponding to the text information, so as to obtain a live video stream to which the subtitle is added.

Further, the aligning module 440 is configured to align the timestamp of the non-silent frame corresponding to each text information with the timestamp of the video frame in the live video stream according to the timestamp of the non-silent frame corresponding to each text information and the absolute time of the live video stream.

Further, the adding module 450 is configured to add a subtitle corresponding to any text information in each text information to a picture of a video frame corresponding to the text information, so as to obtain a live video stream added with subtitles.

Further, the obtaining module 410 is configured to obtain the live video stream according to the address information of the live video stream, and obtain an audio stream in the live video stream.

Further, the device also comprises a format conversion module, which is used for converting the format of the audio stream into a format which can be recognized by an automatic voice recognition technology after the audio stream in the live video stream is acquired.

Further, the device further comprises a stream pushing module, which is used for sending the live video stream added with the subtitles to a second device, and the second device is used for sending the live video stream added with the subtitles to a user terminal.

The processing apparatus for live video stream provided in the embodiment of the present disclosure may execute steps executed by a client or a server in the processing method for live video stream provided in the embodiment of the present disclosure, and the steps and the beneficial effects are not described herein again.

Fig. 12 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure. Referring now specifically to fig. 12, a schematic block diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device 1000 in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), a wearable electronic device, and the like, and fixed terminals such as a digital TV, a desktop computer, a smart home device, and the like. The electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage device 1008 into a Random Access Memory (RAM)1003 to implement a processing method of a live video stream according to an embodiment as described in the present disclosure. In the RAM1003, various programs and information necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communications apparatus 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While fig. 12 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flow chart, thereby implementing the method of processing a live video stream as described above. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include an information signal propagated in baseband or as part of a carrier wave, in which computer readable program code is carried. Such a propagated information signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital information communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

Optionally, when the one or more programs are executed by the electronic device, the electronic device may further perform other steps described in the above embodiments.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It is noted that, in this document, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing a live video stream, the method comprising:

2. The method of claim 1, wherein after performing speech recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages, the method further comprises:

translating each text message into a target text of a target language;

correspondingly, generating a subtitle corresponding to each text message according to each text message and the time information corresponding to each text message includes:

and generating a subtitle corresponding to each text message according to each text message, the target text corresponding to each text message and the time information corresponding to each text message.

3. The method of claim 2, wherein after translating each text message into a target text in a target language, the method further comprises:

displaying each piece of text information and a target text corresponding to each piece of text information;

4. The method of claim 2, wherein after translating each text message into a target text in a target language, the method further comprises:

sending each piece of text information and a target text corresponding to each piece of text information to first equipment;

5. The method according to claim 3 or 4, wherein generating the subtitle corresponding to each text message according to each text message, the target text corresponding to each text message, and the time information corresponding to each text message comprises:

if any text information in each text information and the target text corresponding to any text information are modified, generating a subtitle corresponding to any text information according to the modified text information of any text information, the modified target text corresponding to any text information and the time information corresponding to any text information.

6. The method according to claim 3 or 4, wherein generating the subtitle corresponding to each text message according to each text message, the target text corresponding to each text message, and the time information corresponding to each text message comprises:

and if the target text corresponding to any text information in each text information is modified, generating a subtitle corresponding to any text information according to any text information, the modified target text corresponding to any text information and the time information corresponding to any text information.

7. The method of claim 1, wherein performing speech recognition on the audio stream to obtain one or more text messages corresponding to the audio stream and time information corresponding to each text message in the one or more text messages comprises:

and performing voice recognition on a plurality of continuous audio frames in the audio stream to obtain text information corresponding to non-silent frames in the audio frames and time information corresponding to the text information, wherein the time information corresponding to the text information comprises time stamps of the non-silent frames corresponding to the text information.

8. The method of claim 7, wherein temporally aligning the subtitles with video frames of the live video stream according to the time information, and adding the subtitles corresponding to each text information to the live video stream to obtain the live video stream added with subtitles, comprises:

aligning the time stamp of the non-silent frame corresponding to each text message with the time stamp of the video frame in the live video stream to obtain the video frames corresponding to each text message in the live video stream;

and adding the subtitle corresponding to any text information in each text information into the video frame corresponding to any text information to obtain the live video stream added with the subtitle.

9. The method of claim 8, wherein aligning the timestamp of the non-silent frame corresponding to each text message with the timestamp of the video frame in the live video stream comprises:

and aligning the time stamp of the non-silent frame corresponding to each text message with the time stamp of the video frame in the live video stream according to the time stamp of the non-silent frame corresponding to each text message and the absolute time of the live video stream.

10. The method of claim 8, wherein adding a subtitle corresponding to any one of the text messages to a video frame corresponding to the text message to obtain a live video stream added with subtitles comprises:

and adding the subtitle corresponding to any text information in each text information into the picture of the video frame corresponding to any text information to obtain the live video stream added with the subtitle.

11. The method of claim 1, wherein obtaining a live video stream and obtaining an audio stream in the live video stream comprises:

and acquiring the live video stream according to the address information of the live video stream, and acquiring an audio stream in the live video stream.

12. The method of claim 1 or 11, wherein after acquiring the audio stream in the live video stream, the method further comprises:

converting the format of the audio stream into a format recognizable by an automatic speech recognition technology.

13. The method of claim 1, further comprising:

and sending the live video stream added with the subtitles to second equipment, wherein the second equipment is used for sending the live video stream added with the subtitles to a user terminal.

14. A device for processing a live video stream, comprising:

15. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-13.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-13.