CN114079797A - Live subtitle generation method and device, server, live client and live system - Google Patents

Live subtitle generation method and device, server, live client and live system Download PDF

Info

Publication number
CN114079797A
CN114079797A CN202010818029.6A CN202010818029A CN114079797A CN 114079797 A CN114079797 A CN 114079797A CN 202010818029 A CN202010818029 A CN 202010818029A CN 114079797 A CN114079797 A CN 114079797A
Authority
CN
China
Prior art keywords
live
subtitles
recognition result
subtitle
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010818029.6A
Other languages
Chinese (zh)
Inventor
胡琨
叶婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010818029.6A priority Critical patent/CN114079797A/en
Publication of CN114079797A publication Critical patent/CN114079797A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4305Synchronising client clock from received content stream, e.g. locking decoder clock with encoder clock, extraction of the PCR packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a live subtitle generating method and device, a server side, a live client side and a live system.

Description

Live subtitle generation method and device, server, live client and live system
Technical Field
The present application relates to, but not limited to, speech recognition technology, and in particular, to a live subtitle generating method and apparatus, a server, a live client, and a live system.
Background
Live broadcasts often have a subtitle requirement and even a subtitle translation (foreign language subtitles) requirement.
In order to ensure the accuracy of the subtitles, a shorthand and simultaneous interpretation such as full-manual real-time subtitles such as Chinese and English subtitles input are needed to realize the generation of the subtitles in the live broadcast. Although the method can better ensure the accuracy of the generated subtitles, the real-time performance cannot meet the requirement of live broadcasting.
In some related technologies, a speech recognition technology may also be used to automatically generate subtitles for a video of a user, but accuracy of the generated subtitles cannot be guaranteed, and particularly for a live scene, accuracy of a machine speech recognition technology needs to be improved.
Disclosure of Invention
The application provides a live subtitle generation method and device, a server side, a live client side and a live system, which can ensure the accuracy and the real-time performance of generated subtitles.
The embodiment of the invention provides a live caption generating method, which comprises the following steps:
the server side pushes the collected live broadcast video to the receiving end, and carries out voice recognition on the collected live broadcast audio to obtain a voice recognition result;
generating subtitles of live video according to a voice recognition result;
and pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.
In an illustrative example, the pushing the captured live video to the receiving end includes:
and pushing the acquired live video to the receiving end through a Content Delivery Network (CDN).
In an exemplary embodiment, before pushing the generated subtitles to the receiving end, the method further includes:
and calibrating the generated live video subtitle according to preset important words or historical calibration results.
In one illustrative example, each syllable of the speech recognition results includes one or more recognition results;
the generating of the subtitle of the live video according to the voice recognition result comprises the following steps:
regarding a syllable including a recognition result, taking the recognition result as a voice recognition result of the syllable;
for a syllable comprising a plurality of recognition results, using the recognition result with the highest probability as the voice recognition result of the syllable, or manually selecting one recognition result as the voice recognition result of the syllable;
and forming the determined voice recognition result into a subtitle of the live video.
In one illustrative example, the method further comprises:
translating the collected live broadcast audio into required foreign language subtitles;
calibrating the generated foreign language subtitles of the live video;
the step of pushing the generated subtitles to the receiving end further comprises: and pushing the generated foreign language subtitles to the receiving end.
In one illustrative example, the translating the captured live audio into the desired foreign language subtitles comprises:
translating into the desired foreign language based on the voice of the live audio;
alternatively, the desired foreign language is translated based on the speech recognition result.
The application also provides a computer-readable storage medium, which stores computer-executable instructions for executing the live caption generating method.
The application further provides a device for generating live subtitles, which comprises a memory and a processor, wherein the memory stores the following instructions executable by the processor: the method for generating the live caption comprises the steps of executing the live caption generating method.
The application further provides a live caption generating method, which includes:
a receiving end receives live broadcast video and subtitles from a server end;
based on the timestamp, the received live video and the subtitle are synchronized, and the synchronized live video and the synchronized subtitle are combined to obtain a video with the subtitle;
and displaying the obtained video with the subtitles to a user.
In an exemplary embodiment, when the subtitle arrives at the receiving end later than the live video, the method further includes:
the receiving end processes the received subtitles according to the pre-configuration information;
wherein the pre-configuration information comprises: discarding the subtitles, or displaying the subtitles that arrive with a delay.
The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the live caption generating method described above.
The present application further provides a device for generating live subtitles, including a memory and a processor, where the memory stores the following instructions executable by the processor: for executing the steps of the live subtitle generating method described above.
The present application further provides a server, including: the system comprises a live video processing module, a voice recognition module, a subtitle generation module and a subtitle pushing module; wherein the content of the first and second substances,
the live video processing module is used for pushing the collected live video to a receiving end;
the voice recognition module is used for carrying out voice recognition on the collected live broadcast audio to obtain a voice recognition result;
the subtitle generating module is used for generating subtitles of live video according to the voice recognition result;
and the subtitle pushing module is used for pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.
In one illustrative example, further comprising: and the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results.
In one illustrative example, each syllable of the speech recognition results includes one or more recognition results;
the subtitle generating module is set as follows: for a syllable including a recognition result, taking the recognition result as the voice recognition result of the syllable; for a syllable comprising a plurality of recognition results, using the recognition result with the highest probability as the voice recognition result of the syllable, or manually selecting one recognition result as the voice recognition result of the syllable; and forming the determined voice recognition result into a subtitle of the live video.
In one illustrative example, further comprising: the translation module and the second calibration module; wherein the content of the first and second substances,
the translation module is used for translating the acquired live broadcast audio into required foreign language subtitles;
the second calibration module is used for calibrating the generated foreign language subtitles of the live video; correspondingly, the subtitle pushing module is further configured to: and pushing the foreign language subtitles to the receiving end.
This application provides a live client again, includes: the device comprises a receiving module, a merging processing module and a display module; wherein the content of the first and second substances,
the receiving module is used for receiving live broadcast video and subtitles from a server;
the merging processing module is used for synchronously receiving the live video and the subtitles based on the timestamps, and merging the synchronous live video and the synchronized subtitles to obtain a video with the subtitles;
and the display module is used for displaying the obtained video with the subtitles to a user.
In one illustrative example, when the subtitles arrive later than the live video at the receiving end; the display module is configured to:
processing the received subtitles according to the pre-configuration information;
wherein the pre-configuration information comprises: discarding the subtitles, or displaying the subtitles that arrive with a delay.
The present application further provides a live broadcast system, including: the system comprises a collecting end, a processing server and more than one receiving end, wherein the collecting end is used for obtaining live broadcast videos and live broadcast audios of a live broadcast site; wherein the content of the first and second substances,
the processing server comprises any one of the server side and the receiving end comprises the live broadcast client.
According to the embodiment of the application, the time difference of live broadcast is skillfully utilized, the transmission process of live broadcast video is utilized, and meanwhile, live broadcast audio is processed to quickly generate the corresponding subtitles, so that the accuracy and the real-time performance of the generated subtitles are guaranteed.
Further, the accuracy of calibration is further improved by calibrating the subtitles of the generated live video.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.
Fig. 1 is a schematic flow chart of a live subtitle generating method in an embodiment of the present application;
fig. 2 is a schematic flow chart of a live subtitle generating method in another embodiment of the present application;
fig. 3 is a schematic structural diagram of a server in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a live broadcast client in an embodiment of the present application;
fig. 5 is a schematic diagram illustrating a composition architecture of a live broadcast system in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
In the research on live broadcast processing, the inventor of the present application finds that, when a Content Delivery Network (CDN) is commonly used for delivering live broadcast video streams, different live broadcast schemes have a time difference of 10 to 30 seconds or more when arriving at a user end compared with real-time traffic. The inventors of the present application propose that if the speech stream can be processed using speech recognition techniques to generate subtitles within this time window, then the accuracy and real-time of the generated subtitles can be guaranteed.
Fig. 1 is a schematic flowchart of a live subtitle generating method in an embodiment of the present application, and as shown in fig. 1, the method includes:
step 100: the server side pushes the collected live broadcast video to the receiving end, and carries out voice recognition on the collected live broadcast audio to obtain a voice recognition result.
In an illustrative example, the captured live video may be pushed to the receiving end through the CDN. The server is a server such as a streaming media server which can be used for processing the received real-time audio and video signals, and the receiving end is a live broadcast client which can be used for playing the real-time audio and video with subtitles.
In an exemplary embodiment, step 100 may be preceded by:
recording (such as adopting a high-definition camera as a collection end) video information of a live broadcast site; and (3) acquiring (such as adopting a high-definition microphone as an acquisition end) the audio information of the live broadcast site. Further, the collected audio information may be processed to remove noise.
Note that the live video in step 100 includes audio information.
In an exemplary instance, performing speech recognition on the captured live audio to obtain a speech recognition result may include:
based on the voice recognition technology, voice recognition is carried out on the collected audio to obtain a voice recognition result, wherein each syllable of the voice recognition result can comprise one or more recognition results.
In one illustrative example, the speech recognition results obtained are: and (5) the recognition result exceeding a preset confidence interval.
As can be seen from step 100, in the embodiment of the present application, the time difference of the live broadcast itself is skillfully utilized, the transmission process of the live broadcast video is utilized, and the live broadcast audio is processed to quickly generate the corresponding subtitles.
Step 101: and the server generates the subtitle of the live video according to the voice recognition result.
In one illustrative example, step 101 may comprise:
regarding a syllable of a voice recognition result including a recognition result, regarding the recognition result as a voice recognition result of the syllable; for a syllable including a speech recognition result of a plurality of recognition results, the recognition result with the highest probability is used as the speech recognition result of the syllable, or one recognition result is manually selected as the default speech recognition result of the syllable;
and forming the determined voice recognition result into a subtitle of the live video.
The recognition result after the speech recognition is calibrated through step 101, the recognition error is corrected or a more accurate recognition result is determined, and the high quality of the generated caption is ensured, that is, the accuracy of the generated caption is improved.
In a practical application scenario, for example, when speech recognition of a sentence is completed, the speech recognition technology in the related technology can ensure an accuracy of about 90% or more, and then the step of correcting the recognition error or determining a more accurate recognition result according to the embodiment of the present application completes rapid calibration of the speech recognition result, so that the accuracy can be improved to 99% or more. Taking the time difference of the live video stream reaching the user end of about 10 to 30 seconds as an example, it is sufficient if a processing time window of 10 seconds is reserved in the embodiment of the present application to realize the calibration of the voice recognition result. In practical applications, the size of the processing time window may be given by network testing.
Step 102: and the server pushes the generated subtitles to the receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.
In an exemplary embodiment, the server pushes the generated subtitles, that is, the subtitles after the calibration, to the receiving end immediately, so that the receiving end merges the received live video and the subtitles after synchronizing based on the timestamp to obtain the video with the subtitles.
In an exemplary embodiment, before pushing the generated subtitles to the receiving end, the live subtitle generating method in the embodiment of the present application may further include:
and calibrating the generated live video subtitles according to preset important words or historical calibration results. Therefore, the calibration accuracy is further improved, and the calibration speed is also improved. For example, for a person who is new or has a nonstandard mandarin speech, there may be recognition errors in some words expressed by the speech, for example, the speech includes "ma teacher just puts forth five new", wherein the word "five new" is not very common and is likely to be recognized as "five stars", then, by manually inputting the correct speech recognition result, i.e., "five new" in the embodiment of the present application, the subsequent calibration may use the important word manually input as an option to complete the calibration process quickly. Or, the preset important word is that the five new words are input in advance, so that the weight of the five new words is higher than that of the five stars in the live broadcast, and the effect of improving the accuracy is achieved.
In an exemplary embodiment, the live subtitle generating method according to the embodiment of the present application may further include:
the server side translates the collected live broadcast audio into required foreign language subtitles;
calibrating the generated foreign language subtitles of the live video;
accordingly, step 102 further comprises: and pushing the generated foreign language subtitles to a receiving end.
In an illustrative example, the server translating the captured live audio into the desired foreign language subtitles may include:
directly translating the voice into a required foreign language based on live audio; alternatively, the translation is to a desired foreign language based on the speech recognition results.
In one illustrative example, the foreign language may include, but is not limited to: chinese, english, japanese, french, german, etc.
By the live caption generating method, accuracy and real-time performance of the generated caption are guaranteed.
The application also provides a computer-readable storage medium storing computer-executable instructions for executing the live caption generating method of any one of the above.
The present application further provides a device for generating live subtitles, including a memory and a processor, where the memory stores the following instructions executable by the processor: the method for generating the live caption comprises the steps of executing the live caption generating method.
Fig. 2 is a schematic flowchart of a live subtitle generating method in another embodiment of the present application, and as shown in fig. 2, the method includes:
step 200: and the receiving end receives the live video and the subtitles from the service end.
Step 201: the receiving end synchronously receives the live video and the subtitles based on the timestamps, and the synchronized live video and the synchronized subtitles are combined to obtain the video with the subtitles;
step 202: and displaying the obtained video with the subtitles to a user.
By the live caption generating method, accuracy and real-time performance of the generated caption are guaranteed.
In an exemplary embodiment, if the subtitles arrive at the receiving end later than the live video, for example, because too long of calibration time is spent at the service end, the live subtitle generating method may further include:
the receiving end processes the received subtitles according to the pre-configuration information, which in an exemplary embodiment may include, for example: the subtitle is discarded, that is, the subtitle arriving with delay is not displayed, or the subtitle arriving with delay is directly displayed.
The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the live subtitle generating method shown in any one of fig. 2.
The present application further provides a device for generating live subtitles, including a memory and a processor, where the memory stores the following instructions executable by the processor: for performing the steps of the live subtitle generating method of any one of fig. 2.
In an exemplary embodiment, for video content with subtitles, certain information may be specifically displayed according to user settings, such as: particularly displaying keywords closely related to the video theme or content, such as highlighting, flower character displaying and the like.
In an illustrative example, the presentation form, language, etc. of the subtitles may be selectable by the user, thereby establishing a feedback mechanism between the user and the system.
Fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application, as shown in fig. 3, the structural diagram at least includes: the system comprises a live video processing module, a voice recognition module, a subtitle generation module and a subtitle pushing module; wherein the content of the first and second substances,
the live video processing module is used for pushing the collected live video to a receiving end;
the voice recognition module is used for carrying out voice recognition on the collected live broadcast audio to obtain a voice recognition result;
the subtitle generating module is used for generating subtitles of live video according to the voice recognition result;
and the subtitle pushing module is used for pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.
In one illustrative example, the live video processing module may be configured to: and pushing the collected live video to a receiving end through the CDN. The server is a server such as a streaming media server which can be used for processing the received real-time audio and video signals.
In an exemplary embodiment, the collected audio is subjected to speech recognition to obtain a speech recognition result; wherein each syllable of the speech recognition result comprises one or more recognition results.
In the embodiment of the application, the server side skillfully utilizes the time difference of live broadcast, and processes live broadcast audio to quickly generate corresponding subtitles by utilizing the transmission process of live broadcast video.
In one illustrative example, the subtitle generation module may be configured to: regarding a syllable of a voice recognition result including a recognition result, regarding the recognition result as a voice recognition result of the syllable; for a syllable of a voice recognition result comprising a plurality of recognition results, the recognition result with the highest probability is used as the voice recognition result of the syllable, or one recognition result is manually selected as the default voice recognition result of the syllable; and forming the determined voice recognition result into a subtitle of the live video.
In the embodiment of the application, the server side calibrates the recognition result after the voice recognition, corrects the recognition error or determines a more accurate recognition result, and ensures the high quality of the generated caption, that is, the accuracy of the generated caption is improved.
In an exemplary embodiment, the server may further include: a first calibration module;
and the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results.
In an exemplary embodiment, the server may further include: a translation module;
the translation module is used for translating the collected live broadcast audio into the required foreign language captions; the server can further comprise: the second calibration module is used for calibrating the generated foreign language subtitles of the live video; correspondingly, the caption pushing module is also configured to push the foreign language caption to the receiving end.
In one illustrative example, translating captured live audio into desired foreign language subtitles includes: translating the voice into a required foreign language based on live audio; alternatively, the translation is to a desired foreign language based on the speech recognition results.
By the aid of the server, accuracy and real-time performance of the generated subtitles are guaranteed.
Fig. 4 is a schematic structural diagram of a composition of a live client in another embodiment of the present application, as shown in fig. 4, the live client at least includes: the device comprises a receiving module, a merging processing module and a display module; wherein the content of the first and second substances,
the receiving module is used for receiving live broadcast video and subtitles from a server;
the merging processing module is used for synchronously receiving the live video and the subtitles based on the timestamps, and merging the synchronous live video and the synchronized subtitles to obtain a video with the subtitles;
and the display module is used for displaying the obtained video with the subtitles to a user.
In an illustrative example, if the subtitles arrive at the receiving end later than the live video, such as due to too long of calibration time spent at the serving end, the display module is configured to:
the received subtitles are processed according to the preconfiguration information, which in an exemplary embodiment may include, for example: the subtitles are discarded, i.e., the subtitles arriving with delay are not displayed (i.e., only video is displayed), or the subtitles arriving with delay are directly displayed.
By the aid of the live broadcast client, accuracy and real-time performance of the generated subtitles are guaranteed.
The present application further provides a live broadcast system, including: the system comprises a collecting end, a processing server and more than one receiving end, wherein the collecting end is used for obtaining live broadcast videos and live broadcast audios of a live broadcast site; the processing server comprises the server side and the receiving end comprises the live broadcast client side.
Fig. 5 is a schematic diagram illustrating a composition architecture of a live broadcast system in an embodiment of the present application, as shown in fig. 5, in the embodiment, the live broadcast system includes a collecting terminal, a processing server, and receiving terminals (only one receiving terminal is shown in fig. 5); wherein the content of the first and second substances,
the acquisition terminal is used for acquiring live broadcast video and live broadcast audio of a live broadcast site;
the processing server includes live video processing module, and live video processing module includes: a video distribution module and a CDN; the video distribution module is configured to push the collected live video to the receiving end through the CDN.
In the embodiment, the live webcast passes through the CDN delivery process, and because there is a network delay, there is a delay of 10 to 30 seconds or more compared to the real-time traffic.
The processing server further includes: the device comprises a voice recognition module, a subtitle generation module, a first calibration module, a translation module, a second calibration module and a subtitle pushing module; wherein the content of the first and second substances,
the voice recognition module is used for performing voice recognition on the collected live broadcast audio to obtain a voice recognition result, more specifically, performing voice recognition on the collected audio to obtain a voice recognition result, and each syllable recognized by the voice can comprise one or more recognition results;
a caption generating module configured to generate a caption of the live video based on the voice recognition result, and more specifically, regarding a syllable of the voice recognition result including one recognition result, taking the recognition result as the voice recognition result of the syllable; for a syllable of a speech recognition result including a plurality of recognition results, selecting the recognition result with the highest probability or manually selecting one recognition result as the default speech recognition result of the syllable; forming the determined voice recognition result into a subtitle of a live video;
the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results;
the translation module is used for translating the voice based on the live broadcast audio into the required foreign language captions;
the second calibration module is used for calibrating the generated foreign language subtitles of the live video;
and the caption pushing module is used for pushing the calibrated captions and the foreign language captions to the receiving end.
In this embodiment, 9 to 29 seconds of editing time is required for processing the broadcast audio stream, and the pushed subtitles and the foreign subtitles have a network delay of approximately 1 second.
In the live broadcast system in the embodiment, a network live broadcast is utilized to distribute through the CDN, and due to the existence of network delay, compared with real-time flow, the delay of more than 10-30 seconds is utilized, and by utilizing the time difference, the live broadcast system in the embodiment of the application combines voice recognition and calibration, so that the acquisition of subtitles is completed while live broadcast video is transmitted, and therefore the subtitles also reach the receiving end before (at least while) the video reaches the receiving end, and a user at the receiving end supports the requirement of live broadcast of real-time subtitles; meanwhile, a person with a certain foreign language foundation supports live broadcast of foreign language simultaneous transfer subtitles by combining with a machine translation process, so that the threshold of international live broadcast is greatly reduced. By the aid of the live broadcasting system, accuracy and real-time performance of the generated subtitles are guaranteed.
Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims (19)

1. A live subtitle generating method comprises the following steps:
the server side pushes the collected live broadcast video to the receiving end, and carries out voice recognition on the collected live broadcast audio to obtain a voice recognition result;
generating subtitles of live video according to a voice recognition result;
and pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.
2. The live subtitle generating method of claim 1, wherein the pushing the captured live video to a receiving end comprises:
and pushing the acquired live video to the receiving end through a Content Delivery Network (CDN).
3. The live-broadcast subtitle generating method of claim 1, before pushing the generated subtitle to a receiving end, further comprising:
and calibrating the generated live video subtitle according to preset important words or historical calibration results.
4. The live subtitle generating method of claim 1 or 3, wherein each syllable of the speech recognition result includes one or more recognition results;
the generating of the subtitle of the live video according to the voice recognition result comprises the following steps:
regarding a syllable including a recognition result, taking the recognition result as a voice recognition result of the syllable;
for a syllable comprising a plurality of recognition results, using the recognition result with the highest probability as the voice recognition result of the syllable, or manually selecting one recognition result as the voice recognition result of the syllable;
and forming the determined voice recognition result into a subtitle of the live video.
5. The live subtitle generating method of claim 1, the method further comprising:
translating the collected live broadcast audio into required foreign language subtitles;
calibrating the generated foreign language subtitles of the live video;
the step of pushing the generated subtitles to the receiving end further comprises: and pushing the generated foreign language subtitles to the receiving end.
6. The live subtitle generating method of claim 5, wherein the translating the captured live audio into the desired foreign subtitle comprises:
translating into the desired foreign language based on the voice of the live audio;
alternatively, the desired foreign language is translated based on the speech recognition result.
7. A computer-readable storage medium storing computer-executable instructions for performing the live subtitle generating method of any one of claims 1-6.
8. An apparatus for enabling live caption generation, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing a live subtitle generating method of any one of claims 1-6.
9. A live subtitle generating method comprises the following steps:
a receiving end receives live broadcast video and subtitles from a server end;
based on the timestamp, the received live video and the subtitle are synchronized, and the synchronized live video and the synchronized subtitle are combined to obtain a video with the subtitle;
and displaying the obtained video with the subtitles to a user.
10. The live-broadcast subtitle generating method of claim 9, when the subtitle arrives at the receiving end later than the live-broadcast video, further comprising:
the receiving end processes the received subtitles according to the pre-configuration information;
wherein the pre-configuration information comprises: discarding the subtitles, or displaying the subtitles that arrive with a delay.
11. A computer-readable storage medium storing computer-executable instructions for performing the live subtitle generating method of claim 9 or 10.
12. An apparatus for enabling live caption generation, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing the live subtitle generating method of claim 9 or 10.
13. A server, comprising: the system comprises a live video processing module, a voice recognition module, a subtitle generation module and a subtitle pushing module; wherein the content of the first and second substances,
the live video processing module is used for pushing the collected live video to a receiving end;
the voice recognition module is used for carrying out voice recognition on the collected live broadcast audio to obtain a voice recognition result;
the subtitle generating module is used for generating subtitles of live video according to the voice recognition result;
and the subtitle pushing module is used for pushing the generated subtitles to a receiving end so that the receiving end combines the live video and the subtitles to obtain the live video with the subtitles.
14. The server according to claim 13, further comprising: and the first calibration module is set to calibrate the generated live video subtitles according to preset important words or historical calibration results.
15. The server according to claim 13 or 14, wherein each syllable of the speech recognition result comprises one or more recognition results;
the subtitle generating module is set as follows: for a syllable including a recognition result, taking the recognition result as the voice recognition result of the syllable; for a syllable comprising a plurality of recognition results, using the recognition result with the highest probability as the voice recognition result of the syllable, or manually selecting one recognition result as the voice recognition result of the syllable; and forming the determined voice recognition result into a subtitle of the live video.
16. The server according to claim 13, further comprising: the translation module and the second calibration module; wherein the content of the first and second substances,
the translation module is used for translating the acquired live broadcast audio into required foreign language subtitles;
the second calibration module is used for calibrating the generated foreign language subtitles of the live video; correspondingly, the subtitle pushing module is further configured to: and pushing the foreign language subtitles to the receiving end.
17. A live client, comprising: the device comprises a receiving module, a merging processing module and a display module; wherein the content of the first and second substances,
the receiving module is used for receiving live broadcast video and subtitles from a server;
the merging processing module is used for synchronously receiving the live video and the subtitles based on the timestamps, and merging the synchronous live video and the synchronized subtitles to obtain a video with the subtitles;
and the display module is used for displaying the obtained video with the subtitles to a user.
18. The live client of claim 17, when the subtitles arrive later than the live video at the receiving end; the display module is configured to:
processing the received subtitles according to the pre-configuration information;
wherein the pre-configuration information comprises: discarding the subtitles, or displaying the subtitles that arrive with a delay.
19. A live system, comprising: the system comprises a collecting end, a processing server and more than one receiving end, wherein the collecting end is used for obtaining live broadcast videos and live broadcast audios of a live broadcast site; wherein the content of the first and second substances,
the processing server comprises the server side of any one of claims 13-16, and the receiving side comprises the live client side of claim 17 or 18.
CN202010818029.6A 2020-08-14 2020-08-14 Live subtitle generation method and device, server, live client and live system Pending CN114079797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010818029.6A CN114079797A (en) 2020-08-14 2020-08-14 Live subtitle generation method and device, server, live client and live system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010818029.6A CN114079797A (en) 2020-08-14 2020-08-14 Live subtitle generation method and device, server, live client and live system

Publications (1)

Publication Number Publication Date
CN114079797A true CN114079797A (en) 2022-02-22

Family

ID=80279427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010818029.6A Pending CN114079797A (en) 2020-08-14 2020-08-14 Live subtitle generation method and device, server, live client and live system

Country Status (1)

Country Link
CN (1) CN114079797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002502A (en) * 2022-07-29 2022-09-02 广州市千钧网络科技有限公司 Data processing method and server

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1870728A (en) * 2005-05-23 2006-11-29 北京大学 Method and system for automatic subtilting
CN102970618A (en) * 2012-11-26 2013-03-13 河海大学 Video on demand method based on syllable identification
CN103578465A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Speech recognition method and electronic device
CN104581221A (en) * 2014-12-25 2015-04-29 广州酷狗计算机科技有限公司 Video live broadcasting method and device
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN107741928A (en) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 A kind of method to text error correction after speech recognition based on field identification
CN108184135A (en) * 2017-12-28 2018-06-19 泰康保险集团股份有限公司 Method for generating captions and device, storage medium and electric terminal
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN108428446A (en) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 Audio recognition method and device
CN207854084U (en) * 2018-01-17 2018-09-11 科大讯飞股份有限公司 A kind of caption display system
CN108600773A (en) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 Caption data method for pushing, subtitle methods of exhibiting, device, equipment and medium
CN108597497A (en) * 2018-04-03 2018-09-28 中译语通科技股份有限公司 A kind of accurate synchronization system of subtitle language and method, information data processing terminal
CN108984529A (en) * 2018-07-16 2018-12-11 北京华宇信息技术有限公司 Real-time court's trial speech recognition automatic error correction method, storage medium and computing device
CN110870004A (en) * 2017-07-10 2020-03-06 沃克斯边界公司 Syllable-based automatic speech recognition

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1870728A (en) * 2005-05-23 2006-11-29 北京大学 Method and system for automatic subtilting
CN102970618A (en) * 2012-11-26 2013-03-13 河海大学 Video on demand method based on syllable identification
CN103578465A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Speech recognition method and electronic device
CN104581221A (en) * 2014-12-25 2015-04-29 广州酷狗计算机科技有限公司 Video live broadcasting method and device
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN110870004A (en) * 2017-07-10 2020-03-06 沃克斯边界公司 Syllable-based automatic speech recognition
CN107741928A (en) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 A kind of method to text error correction after speech recognition based on field identification
CN108184135A (en) * 2017-12-28 2018-06-19 泰康保险集团股份有限公司 Method for generating captions and device, storage medium and electric terminal
CN207854084U (en) * 2018-01-17 2018-09-11 科大讯飞股份有限公司 A kind of caption display system
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN108428446A (en) * 2018-03-06 2018-08-21 北京百度网讯科技有限公司 Audio recognition method and device
CN108597497A (en) * 2018-04-03 2018-09-28 中译语通科技股份有限公司 A kind of accurate synchronization system of subtitle language and method, information data processing terminal
CN108600773A (en) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 Caption data method for pushing, subtitle methods of exhibiting, device, equipment and medium
CN108984529A (en) * 2018-07-16 2018-12-11 北京华宇信息技术有限公司 Real-time court's trial speech recognition automatic error correction method, storage medium and computing device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002502A (en) * 2022-07-29 2022-09-02 广州市千钧网络科技有限公司 Data processing method and server

Similar Documents

Publication Publication Date Title
US11252444B2 (en) Video stream processing method, computer device, and storage medium
US11463779B2 (en) Video stream processing method and apparatus, computer device, and storage medium
US10341694B2 (en) Data processing method and live broadcasting method and device
CN107911646B (en) Method and device for sharing conference and generating conference record
CN112437337A (en) Method, system and equipment for realizing live broadcast real-time subtitles
WO2019205886A1 (en) Method and apparatus for pushing subtitle data, subtitle display method and apparatus, device and medium
US8229748B2 (en) Methods and apparatus to present a video program to a visually impaired person
US20130204605A1 (en) System for translating spoken language into sign language for the deaf
CN112616062B (en) Subtitle display method and device, electronic equipment and storage medium
US20160066055A1 (en) Method and system for automatically adding subtitles to streaming media content
EP3742742A1 (en) Method, apparatus and system for synchronously playing message stream and audio/video stream
CN112601101A (en) Subtitle display method and device, electronic equipment and storage medium
US20120105719A1 (en) Speech substitution of a real-time multimedia presentation
CN112601102A (en) Method and device for determining simultaneous interpretation of subtitles, electronic equipment and storage medium
US20160098395A1 (en) System and method for separate audio program translation
KR20160146527A (en) Method and Apparatus For Sharing Multimedia Contents
FR3025925A1 (en) METHOD FOR CONTROLLING PRESENTATION MODES OF SUBTITLES
CN114079797A (en) Live subtitle generation method and device, server, live client and live system
KR20150137383A (en) Apparatus and service method for providing many languages of digital broadcasting using real time translation
CN116708892A (en) Sound and picture synchronous detection method, device, equipment and storage medium
CN113891108A (en) Subtitle optimization method and device, electronic equipment and storage medium
KR102200827B1 (en) Mothod and server for generating re-transmission broadcast data including subtitles
KR20150116191A (en) Subtitling broadcast device using fingerprint data
CN114125331A (en) Subtitle adding system
CN115474066A (en) Subtitle processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination