CN114079797A

CN114079797A - Live subtitle generation method and device, server, live client and live system

Info

Publication number: CN114079797A
Application number: CN202010818029.6A
Authority: CN
Inventors: 胡琨; 叶婷
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2022-02-22

Abstract

The application discloses a live subtitle generating method and device, a server side, a live client side and a live system.

Description

Live subtitle generation method and device, server, live client and live system

Technical Field

The present application relates to, but not limited to, speech recognition technology, and in particular, to a live subtitle generating method and apparatus, a server, a live client, and a live system.

Background

Live broadcasts often have a subtitle requirement and even a subtitle translation (foreign language subtitles) requirement.

In order to ensure the accuracy of the subtitles, a shorthand and simultaneous interpretation such as full-manual real-time subtitles such as Chinese and English subtitles input are needed to realize the generation of the subtitles in the live broadcast. Although the method can better ensure the accuracy of the generated subtitles, the real-time performance cannot meet the requirement of live broadcasting.

In some related technologies, a speech recognition technology may also be used to automatically generate subtitles for a video of a user, but accuracy of the generated subtitles cannot be guaranteed, and particularly for a live scene, accuracy of a machine speech recognition technology needs to be improved.

Disclosure of Invention

The application provides a live subtitle generation method and device, a server side, a live client side and a live system, which can ensure the accuracy and the real-time performance of generated subtitles.

The embodiment of the invention provides a live caption generating method, which comprises the following steps:

the server side pushes the collected live broadcast video to the receiving end, and carries out voice recognition on the collected live broadcast audio to obtain a voice recognition result;

generating subtitles of live video according to a voice recognition result;

and pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.

In an illustrative example, the pushing the captured live video to the receiving end includes:

and pushing the acquired live video to the receiving end through a Content Delivery Network (CDN).

In an exemplary embodiment, before pushing the generated subtitles to the receiving end, the method further includes:

and calibrating the generated live video subtitle according to preset important words or historical calibration results.

In one illustrative example, each syllable of the speech recognition results includes one or more recognition results;

the generating of the subtitle of the live video according to the voice recognition result comprises the following steps:

regarding a syllable including a recognition result, taking the recognition result as a voice recognition result of the syllable;

for a syllable comprising a plurality of recognition results, using the recognition result with the highest probability as the voice recognition result of the syllable, or manually selecting one recognition result as the voice recognition result of the syllable;

and forming the determined voice recognition result into a subtitle of the live video.

In one illustrative example, the method further comprises:

translating the collected live broadcast audio into required foreign language subtitles;

calibrating the generated foreign language subtitles of the live video;

the step of pushing the generated subtitles to the receiving end further comprises: and pushing the generated foreign language subtitles to the receiving end.

In one illustrative example, the translating the captured live audio into the desired foreign language subtitles comprises:

translating into the desired foreign language based on the voice of the live audio;

alternatively, the desired foreign language is translated based on the speech recognition result.

The application also provides a computer-readable storage medium, which stores computer-executable instructions for executing the live caption generating method.

The application further provides a device for generating live subtitles, which comprises a memory and a processor, wherein the memory stores the following instructions executable by the processor: the method for generating the live caption comprises the steps of executing the live caption generating method.

The application further provides a live caption generating method, which includes:

a receiving end receives live broadcast video and subtitles from a server end;

based on the timestamp, the received live video and the subtitle are synchronized, and the synchronized live video and the synchronized subtitle are combined to obtain a video with the subtitle;

and displaying the obtained video with the subtitles to a user.

In an exemplary embodiment, when the subtitle arrives at the receiving end later than the live video, the method further includes:

the receiving end processes the received subtitles according to the pre-configuration information;

wherein the pre-configuration information comprises: discarding the subtitles, or displaying the subtitles that arrive with a delay.

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the live caption generating method described above.

The present application further provides a device for generating live subtitles, including a memory and a processor, where the memory stores the following instructions executable by the processor: for executing the steps of the live subtitle generating method described above.

The present application further provides a server, including: the system comprises a live video processing module, a voice recognition module, a subtitle generation module and a subtitle pushing module; wherein the content of the first and second substances,

the live video processing module is used for pushing the collected live video to a receiving end;

the voice recognition module is used for carrying out voice recognition on the collected live broadcast audio to obtain a voice recognition result;

the subtitle generating module is used for generating subtitles of live video according to the voice recognition result;

and the subtitle pushing module is used for pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.

In one illustrative example, further comprising: and the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results.

the subtitle generating module is set as follows: for a syllable including a recognition result, taking the recognition result as the voice recognition result of the syllable; for a syllable comprising a plurality of recognition results, using the recognition result with the highest probability as the voice recognition result of the syllable, or manually selecting one recognition result as the voice recognition result of the syllable; and forming the determined voice recognition result into a subtitle of the live video.

In one illustrative example, further comprising: the translation module and the second calibration module; wherein the content of the first and second substances,

the translation module is used for translating the acquired live broadcast audio into required foreign language subtitles;

the second calibration module is used for calibrating the generated foreign language subtitles of the live video; correspondingly, the subtitle pushing module is further configured to: and pushing the foreign language subtitles to the receiving end.

This application provides a live client again, includes: the device comprises a receiving module, a merging processing module and a display module; wherein the content of the first and second substances,

the receiving module is used for receiving live broadcast video and subtitles from a server;

the merging processing module is used for synchronously receiving the live video and the subtitles based on the timestamps, and merging the synchronous live video and the synchronized subtitles to obtain a video with the subtitles;

and the display module is used for displaying the obtained video with the subtitles to a user.

In one illustrative example, when the subtitles arrive later than the live video at the receiving end; the display module is configured to:

processing the received subtitles according to the pre-configuration information;

The present application further provides a live broadcast system, including: the system comprises a collecting end, a processing server and more than one receiving end, wherein the collecting end is used for obtaining live broadcast videos and live broadcast audios of a live broadcast site; wherein the content of the first and second substances,

the processing server comprises any one of the server side and the receiving end comprises the live broadcast client.

According to the embodiment of the application, the time difference of live broadcast is skillfully utilized, the transmission process of live broadcast video is utilized, and meanwhile, live broadcast audio is processed to quickly generate the corresponding subtitles, so that the accuracy and the real-time performance of the generated subtitles are guaranteed.

Further, the accuracy of calibration is further improved by calibrating the subtitles of the generated live video.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

Fig. 1 is a schematic flow chart of a live subtitle generating method in an embodiment of the present application;

fig. 2 is a schematic flow chart of a live subtitle generating method in another embodiment of the present application;

fig. 3 is a schematic structural diagram of a server in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a live broadcast client in an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a composition architecture of a live broadcast system in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

In the research on live broadcast processing, the inventor of the present application finds that, when a Content Delivery Network (CDN) is commonly used for delivering live broadcast video streams, different live broadcast schemes have a time difference of 10 to 30 seconds or more when arriving at a user end compared with real-time traffic. The inventors of the present application propose that if the speech stream can be processed using speech recognition techniques to generate subtitles within this time window, then the accuracy and real-time of the generated subtitles can be guaranteed.

Fig. 1 is a schematic flowchart of a live subtitle generating method in an embodiment of the present application, and as shown in fig. 1, the method includes:

step 100: the server side pushes the collected live broadcast video to the receiving end, and carries out voice recognition on the collected live broadcast audio to obtain a voice recognition result.

In an illustrative example, the captured live video may be pushed to the receiving end through the CDN. The server is a server such as a streaming media server which can be used for processing the received real-time audio and video signals, and the receiving end is a live broadcast client which can be used for playing the real-time audio and video with subtitles.

In an exemplary embodiment, step 100 may be preceded by:

recording (such as adopting a high-definition camera as a collection end) video information of a live broadcast site; and (3) acquiring (such as adopting a high-definition microphone as an acquisition end) the audio information of the live broadcast site. Further, the collected audio information may be processed to remove noise.

Note that the live video in step 100 includes audio information.

In an exemplary instance, performing speech recognition on the captured live audio to obtain a speech recognition result may include:

based on the voice recognition technology, voice recognition is carried out on the collected audio to obtain a voice recognition result, wherein each syllable of the voice recognition result can comprise one or more recognition results.

In one illustrative example, the speech recognition results obtained are: and (5) the recognition result exceeding a preset confidence interval.

As can be seen from step 100, in the embodiment of the present application, the time difference of the live broadcast itself is skillfully utilized, the transmission process of the live broadcast video is utilized, and the live broadcast audio is processed to quickly generate the corresponding subtitles.

Step 101: and the server generates the subtitle of the live video according to the voice recognition result.

In one illustrative example, step 101 may comprise:

regarding a syllable of a voice recognition result including a recognition result, regarding the recognition result as a voice recognition result of the syllable; for a syllable including a speech recognition result of a plurality of recognition results, the recognition result with the highest probability is used as the speech recognition result of the syllable, or one recognition result is manually selected as the default speech recognition result of the syllable;

The recognition result after the speech recognition is calibrated through step 101, the recognition error is corrected or a more accurate recognition result is determined, and the high quality of the generated caption is ensured, that is, the accuracy of the generated caption is improved.

In a practical application scenario, for example, when speech recognition of a sentence is completed, the speech recognition technology in the related technology can ensure an accuracy of about 90% or more, and then the step of correcting the recognition error or determining a more accurate recognition result according to the embodiment of the present application completes rapid calibration of the speech recognition result, so that the accuracy can be improved to 99% or more. Taking the time difference of the live video stream reaching the user end of about 10 to 30 seconds as an example, it is sufficient if a processing time window of 10 seconds is reserved in the embodiment of the present application to realize the calibration of the voice recognition result. In practical applications, the size of the processing time window may be given by network testing.

Step 102: and the server pushes the generated subtitles to the receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.

In an exemplary embodiment, the server pushes the generated subtitles, that is, the subtitles after the calibration, to the receiving end immediately, so that the receiving end merges the received live video and the subtitles after synchronizing based on the timestamp to obtain the video with the subtitles.

In an exemplary embodiment, before pushing the generated subtitles to the receiving end, the live subtitle generating method in the embodiment of the present application may further include:

and calibrating the generated live video subtitles according to preset important words or historical calibration results. Therefore, the calibration accuracy is further improved, and the calibration speed is also improved. For example, for a person who is new or has a nonstandard mandarin speech, there may be recognition errors in some words expressed by the speech, for example, the speech includes "ma teacher just puts forth five new", wherein the word "five new" is not very common and is likely to be recognized as "five stars", then, by manually inputting the correct speech recognition result, i.e., "five new" in the embodiment of the present application, the subsequent calibration may use the important word manually input as an option to complete the calibration process quickly. Or, the preset important word is that the five new words are input in advance, so that the weight of the five new words is higher than that of the five stars in the live broadcast, and the effect of improving the accuracy is achieved.

In an exemplary embodiment, the live subtitle generating method according to the embodiment of the present application may further include:

the server side translates the collected live broadcast audio into required foreign language subtitles;

calibrating the generated foreign language subtitles of the live video;

accordingly, step 102 further comprises: and pushing the generated foreign language subtitles to a receiving end.

In an illustrative example, the server translating the captured live audio into the desired foreign language subtitles may include:

directly translating the voice into a required foreign language based on live audio; alternatively, the translation is to a desired foreign language based on the speech recognition results.

In one illustrative example, the foreign language may include, but is not limited to: chinese, english, japanese, french, german, etc.

By the live caption generating method, accuracy and real-time performance of the generated caption are guaranteed.

The application also provides a computer-readable storage medium storing computer-executable instructions for executing the live caption generating method of any one of the above.

The present application further provides a device for generating live subtitles, including a memory and a processor, where the memory stores the following instructions executable by the processor: the method for generating the live caption comprises the steps of executing the live caption generating method.

Fig. 2 is a schematic flowchart of a live subtitle generating method in another embodiment of the present application, and as shown in fig. 2, the method includes:

step 200: and the receiving end receives the live video and the subtitles from the service end.

Step 201: the receiving end synchronously receives the live video and the subtitles based on the timestamps, and the synchronized live video and the synchronized subtitles are combined to obtain the video with the subtitles;

step 202: and displaying the obtained video with the subtitles to a user.

In an exemplary embodiment, if the subtitles arrive at the receiving end later than the live video, for example, because too long of calibration time is spent at the service end, the live subtitle generating method may further include:

the receiving end processes the received subtitles according to the pre-configuration information, which in an exemplary embodiment may include, for example: the subtitle is discarded, that is, the subtitle arriving with delay is not displayed, or the subtitle arriving with delay is directly displayed.

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the live subtitle generating method shown in any one of fig. 2.

The present application further provides a device for generating live subtitles, including a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of the live subtitle generating method of any one of fig. 2.

In an exemplary embodiment, for video content with subtitles, certain information may be specifically displayed according to user settings, such as: particularly displaying keywords closely related to the video theme or content, such as highlighting, flower character displaying and the like.

In an illustrative example, the presentation form, language, etc. of the subtitles may be selectable by the user, thereby establishing a feedback mechanism between the user and the system.

Fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application, as shown in fig. 3, the structural diagram at least includes: the system comprises a live video processing module, a voice recognition module, a subtitle generation module and a subtitle pushing module; wherein the content of the first and second substances,

In one illustrative example, the live video processing module may be configured to: and pushing the collected live video to a receiving end through the CDN. The server is a server such as a streaming media server which can be used for processing the received real-time audio and video signals.

In an exemplary embodiment, the collected audio is subjected to speech recognition to obtain a speech recognition result; wherein each syllable of the speech recognition result comprises one or more recognition results.

In the embodiment of the application, the server side skillfully utilizes the time difference of live broadcast, and processes live broadcast audio to quickly generate corresponding subtitles by utilizing the transmission process of live broadcast video.

In one illustrative example, the subtitle generation module may be configured to: regarding a syllable of a voice recognition result including a recognition result, regarding the recognition result as a voice recognition result of the syllable; for a syllable of a voice recognition result comprising a plurality of recognition results, the recognition result with the highest probability is used as the voice recognition result of the syllable, or one recognition result is manually selected as the default voice recognition result of the syllable; and forming the determined voice recognition result into a subtitle of the live video.

In the embodiment of the application, the server side calibrates the recognition result after the voice recognition, corrects the recognition error or determines a more accurate recognition result, and ensures the high quality of the generated caption, that is, the accuracy of the generated caption is improved.

In an exemplary embodiment, the server may further include: a first calibration module;

and the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results.

In an exemplary embodiment, the server may further include: a translation module;

the translation module is used for translating the collected live broadcast audio into the required foreign language captions; the server can further comprise: the second calibration module is used for calibrating the generated foreign language subtitles of the live video; correspondingly, the caption pushing module is also configured to push the foreign language caption to the receiving end.

In one illustrative example, translating captured live audio into desired foreign language subtitles includes: translating the voice into a required foreign language based on live audio; alternatively, the translation is to a desired foreign language based on the speech recognition results.

By the aid of the server, accuracy and real-time performance of the generated subtitles are guaranteed.

Fig. 4 is a schematic structural diagram of a composition of a live client in another embodiment of the present application, as shown in fig. 4, the live client at least includes: the device comprises a receiving module, a merging processing module and a display module; wherein the content of the first and second substances,

In an illustrative example, if the subtitles arrive at the receiving end later than the live video, such as due to too long of calibration time spent at the serving end, the display module is configured to:

the received subtitles are processed according to the preconfiguration information, which in an exemplary embodiment may include, for example: the subtitles are discarded, i.e., the subtitles arriving with delay are not displayed (i.e., only video is displayed), or the subtitles arriving with delay are directly displayed.

By the aid of the live broadcast client, accuracy and real-time performance of the generated subtitles are guaranteed.

The present application further provides a live broadcast system, including: the system comprises a collecting end, a processing server and more than one receiving end, wherein the collecting end is used for obtaining live broadcast videos and live broadcast audios of a live broadcast site; the processing server comprises the server side and the receiving end comprises the live broadcast client side.

Fig. 5 is a schematic diagram illustrating a composition architecture of a live broadcast system in an embodiment of the present application, as shown in fig. 5, in the embodiment, the live broadcast system includes a collecting terminal, a processing server, and receiving terminals (only one receiving terminal is shown in fig. 5); wherein the content of the first and second substances,

the acquisition terminal is used for acquiring live broadcast video and live broadcast audio of a live broadcast site;

the processing server includes live video processing module, and live video processing module includes: a video distribution module and a CDN; the video distribution module is configured to push the collected live video to the receiving end through the CDN.

In the embodiment, the live webcast passes through the CDN delivery process, and because there is a network delay, there is a delay of 10 to 30 seconds or more compared to the real-time traffic.

The processing server further includes: the device comprises a voice recognition module, a subtitle generation module, a first calibration module, a translation module, a second calibration module and a subtitle pushing module; wherein the content of the first and second substances,

the voice recognition module is used for performing voice recognition on the collected live broadcast audio to obtain a voice recognition result, more specifically, performing voice recognition on the collected audio to obtain a voice recognition result, and each syllable recognized by the voice can comprise one or more recognition results;

a caption generating module configured to generate a caption of the live video based on the voice recognition result, and more specifically, regarding a syllable of the voice recognition result including one recognition result, taking the recognition result as the voice recognition result of the syllable; for a syllable of a speech recognition result including a plurality of recognition results, selecting the recognition result with the highest probability or manually selecting one recognition result as the default speech recognition result of the syllable; forming the determined voice recognition result into a subtitle of a live video;

the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results;

the translation module is used for translating the voice based on the live broadcast audio into the required foreign language captions;

the second calibration module is used for calibrating the generated foreign language subtitles of the live video;

and the caption pushing module is used for pushing the calibrated captions and the foreign language captions to the receiving end.

In this embodiment, 9 to 29 seconds of editing time is required for processing the broadcast audio stream, and the pushed subtitles and the foreign subtitles have a network delay of approximately 1 second.

In the live broadcast system in the embodiment, a network live broadcast is utilized to distribute through the CDN, and due to the existence of network delay, compared with real-time flow, the delay of more than 10-30 seconds is utilized, and by utilizing the time difference, the live broadcast system in the embodiment of the application combines voice recognition and calibration, so that the acquisition of subtitles is completed while live broadcast video is transmitted, and therefore the subtitles also reach the receiving end before (at least while) the video reaches the receiving end, and a user at the receiving end supports the requirement of live broadcast of real-time subtitles; meanwhile, a person with a certain foreign language foundation supports live broadcast of foreign language simultaneous transfer subtitles by combining with a machine translation process, so that the threshold of international live broadcast is greatly reduced. By the aid of the live broadcasting system, accuracy and real-time performance of the generated subtitles are guaranteed.

Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A live subtitle generating method comprises the following steps:

generating subtitles of live video according to a voice recognition result;

2. The live subtitle generating method of claim 1, wherein the pushing the captured live video to a receiving end comprises:

3. The live-broadcast subtitle generating method of claim 1, before pushing the generated subtitle to a receiving end, further comprising:

4. The live subtitle generating method of claim 1 or 3, wherein each syllable of the speech recognition result includes one or more recognition results;

5. The live subtitle generating method of claim 1, the method further comprising:

calibrating the generated foreign language subtitles of the live video;

6. The live subtitle generating method of claim 5, wherein the translating the captured live audio into the desired foreign subtitle comprises:

7. A computer-readable storage medium storing computer-executable instructions for performing the live subtitle generating method of any one of claims 1-6.

8. An apparatus for enabling live caption generation, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing a live subtitle generating method of any one of claims 1-6.

9. A live subtitle generating method comprises the following steps:

a receiving end receives live broadcast video and subtitles from a server end;

and displaying the obtained video with the subtitles to a user.

10. The live-broadcast subtitle generating method of claim 9, when the subtitle arrives at the receiving end later than the live-broadcast video, further comprising:

11. A computer-readable storage medium storing computer-executable instructions for performing the live subtitle generating method of claim 9 or 10.

12. An apparatus for enabling live caption generation, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing the live subtitle generating method of claim 9 or 10.

13. A server, comprising: the system comprises a live video processing module, a voice recognition module, a subtitle generation module and a subtitle pushing module; wherein the content of the first and second substances,

14. The server according to claim 13, further comprising: and the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results.

15. The server according to claim 13 or 14, wherein each syllable of the speech recognition result comprises one or more recognition results;

16. The server according to claim 13, further comprising: the translation module and the second calibration module; wherein the content of the first and second substances,

17. A live client, comprising: the device comprises a receiving module, a merging processing module and a display module; wherein the content of the first and second substances,

18. The live client of claim 17, when the subtitles arrive later than the live video at the receiving end; the display module is configured to:

19. A live system, comprising: the system comprises a collecting end, a processing server and more than one receiving end, wherein the collecting end is used for obtaining live broadcast videos and live broadcast audios of a live broadcast site; wherein the content of the first and second substances,

the processing server comprises the server side of any one of claims 13-16, and the receiving side comprises the live client side of claim 17 or 18.