CN111479124A

CN111479124A - Real-time playing method and device

Info

Publication number: CN111479124A
Application number: CN202010313403.7A
Authority: CN
Inventors: 马跃; 李健; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-07-31

Abstract

The embodiment of the invention provides a real-time playing method and a real-time playing device, wherein the method comprises the following steps: acquiring live broadcast video streams and live broadcast audio streams acquired in real time; performing voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream; merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream; sending the merged media stream to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time. The play terminal may obtain the merged media stream with the subtitle information. The lecturer can better understand the lecture content of the lecturer based on the subtitle information in the combined media stream, and the lecture efficiency of the lecturer is improved.

Description

Real-time playing method and device

Technical Field

The present invention relates to the multimedia technology field, and in particular, to a real-time playing method and a real-time playing device.

Background

In a large lecture, class or other scene, a large conference hall is generally used for lecturing in order to accommodate more listeners. Because the meeting place is large, generally speaking, the meeting place can be provided with a plurality of display screens for live real-time rebroadcasting, so that the listeners in the meeting place can see the real-time dynamic state and the speech content of the speaker.

However, since the conference room is large and there are many persons, the listeners behind the seat are still likely to be unable to clearly hear the main speaker and the speech content, resulting in low speech efficiency. Meanwhile, the speaker behind the seat is difficult to effectively interact with the speaker in time according to the speech condition. Especially, in the scene that students are in class in a large classroom, the students in the back row are difficult to clearly listen to the teaching content of the teacher, and the teacher is difficult to interact with the students, know the teaching condition in time and interactively answer doubts with the students.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a real-time playing method and a corresponding real-time playing apparatus that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses a real-time playing method, where the method includes:

Acquiring live broadcast video streams and live broadcast audio streams acquired in real time;

Performing voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream;

Merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream;

Sending the merged media stream to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time.

Optionally, the step of performing voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream includes:

Performing voice recognition on the live audio stream to acquire at least one text message and a timestamp corresponding to the text message;

And generating subtitle information corresponding to the live audio stream by adopting the at least one piece of text information and the timestamp corresponding to the text information.

Optionally, the method further comprises:

Determining at least one piece of key information in the at least one piece of text information;

And generating key mark information by adopting the time stamp of the key information.

Optionally, the step of acquiring a live video stream and a live audio stream collected in real time includes:

Acquiring a live broadcast video stream and a live broadcast audio stream sent by a preset live broadcast acquisition client; the live broadcast acquisition client is used for acquiring live broadcast video streams acquired by preset video acquisition equipment in real time and acquiring live broadcast audio streams acquired by preset audio acquisition equipment in real time.

Optionally, the method further comprises:

Receiving interactive information sent by the playing terminal;

Sending the interaction information to the live broadcast acquisition client; the live broadcast acquisition client is used for displaying the interactive information.

Optionally, the step of sending the interaction information to the live broadcast collecting client includes:

Classifying the interaction information to obtain classified statistical information of the interaction information;

And sending the interactive information and the classification statistical information of the interactive information to the live broadcast acquisition client.

The embodiment of the invention also discloses a real-time playing device, which comprises:

The acquisition module is used for acquiring live broadcast video streams and live broadcast audio streams which are acquired in real time;

The subtitle generating module is used for carrying out voice recognition processing on the live broadcast audio stream to obtain subtitle information corresponding to the live broadcast audio stream;

A merging module, configured to merge the live video stream, the live audio stream, and the subtitle information to obtain a merged media stream;

The sending module is used for sending the merged media stream to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time.

Optionally, the subtitle generating module includes:

The recognition submodule is used for carrying out voice recognition on the live broadcast audio stream to obtain at least one piece of text information and a timestamp corresponding to the text information;

And the subtitle generating submodule is used for generating subtitle information corresponding to the live audio stream by adopting the at least one text message and the timestamp corresponding to the text message.

Optionally, the apparatus further comprises:

Optionally, the obtaining module includes:

The acquisition submodule is used for acquiring a live broadcast video stream and a live broadcast audio stream sent by a preset live broadcast acquisition client; the live broadcast acquisition client is used for acquiring live broadcast video streams acquired by preset video acquisition equipment in real time and acquiring live broadcast audio streams acquired by preset audio acquisition equipment in real time.

Optionally, the apparatus further comprises:

The interactive receiving module is used for receiving the interactive information sent by the playing terminal;

The interactive transmission module is used for transmitting the interactive information to the live broadcast acquisition client; the live broadcast acquisition client is used for displaying the interactive information.

Optionally, the interactive transmission module includes:

The statistic submodule is used for classifying the interaction information to obtain classified statistic information of the interaction information;

And the interactive sending submodule is used for sending the interactive information and the classification statistical information of the interactive information to the live broadcast acquisition client.

The embodiment of the invention also discloses a real-time playing system which comprises a live broadcast acquisition client, a multimedia processing server and a playing terminal.

The live broadcast acquisition client is used for sending live broadcast video streams and live broadcast audio streams acquired in real time to the multimedia processing server;

The multimedia processing server is used for carrying out voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream; merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream; sending the merged media stream to the playing terminal;

The playing terminal is used for playing the obtained merged media stream in real time.

The embodiment of the invention also discloses a device, which comprises:

One or more processors; and

One or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform one or more methods as described in embodiments of the invention.

One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform one or more methods as described in embodiments of the invention.

The embodiment of the invention has the following advantages:

According to the real-time playing method provided by the embodiment of the invention, the live video stream and the live audio stream which are collected in real time are obtained, the voice recognition processing is carried out on the live audio stream which is collected in real time, the subtitle information corresponding to the live audio stream is obtained, the live video stream, the live audio stream and the subtitle information are combined to obtain a combined media stream, and the combined media stream is sent to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time. So that the play terminal can acquire the merged media stream with the subtitle information. The lecturer can better understand the lecture content of the lecturer based on the subtitle information in the combined media stream, and the lecture efficiency of the lecturer is improved.

Drawings

FIG. 1 is a flowchart illustrating steps of a real-time playing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of steps in another embodiment of a real-time playing method according to the present invention;

FIG. 3 is a diagram illustrating an embodiment of a real-time playing method according to the present invention;

Fig. 4 is a communication diagram of an embodiment of a real-time playing method according to the present invention;

Fig. 5 is a block diagram of a real-time playing device according to an embodiment of the present invention;

Fig. 6 is a schematic diagram of an embodiment of a real-time playing system according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Aiming at the condition that the speech content of a speaker cannot be clearly heard, in the speech process of the speaker, the embodiment of the invention acquires live broadcast video stream and live broadcast audio stream acquired in real time by acquiring the voice, real-time dynamic state and courseware content of the speaker in real time, performs voice recognition processing on the live broadcast audio stream to acquire subtitle information corresponding to the live broadcast audio stream, and sends the merged media stream added with the subtitle information to a playing terminal of the speaker, so that the speaker can watch the merged media stream based on the subtitle information, and the speech efficiency of the speaker is improved.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a real-time playing method according to an embodiment of the present invention is shown, which may specifically include the following steps:

Step 101, acquiring live broadcast video stream and live broadcast audio stream collected in real time;

In the embodiment of the invention, the speech content of the speaker can be collected in real time in the speech process of the speaker. Specifically, the real-time dynamic state of the speaker and the courseware content displayed by the speaker can be collected in real time, and a live broadcast video stream collected in real time is obtained. Meanwhile, the speech of the speaker can be acquired in real time, and a live audio stream acquired in real time is obtained.

In a specific implementation, a multimedia processing server may be used to continuously obtain the live video stream and the live audio stream acquired in real time based on a long connection. The live video stream may be a video streaming media captured in real time. The live audio stream may be audio streaming media collected in real-time.

102, performing voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream;

In the embodiment of the present invention, the live audio stream obtained in real time may be subjected to speech recognition processing, speech information included in the live audio stream is recognized, and the speech information is converted into text, so as to obtain subtitle information corresponding to the live audio stream. The subtitle information may be text information corresponding to speech included in the live audio stream.

In a specific implementation, the multimedia processing server may employ a preset speech recognition model to implement speech recognition processing on the live audio stream. For example, the speech recognition model may be a hidden Markov model, an N-gram language model, a deep learning neural network, etc., which is not limited by the present invention.

Step 103, merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream;

In the embodiment of the present invention, the separated live video stream, live audio stream, and subtitle information may be merged to obtain a merged media stream including the live video stream, live audio stream, and subtitle information.

In a specific implementation, the multimedia processing server may encapsulate the live video stream, the live audio stream, and the subtitle information into a same file, so as to obtain the merged media stream.

Step 104, sending the merged media stream to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time.

In the embodiment of the invention, at least one playing terminal can be preset. The merged media stream may be sent to the play terminal, so that the play terminal may obtain the merged media stream with the subtitle information. The playing terminal can play the obtained merged media stream in real time, so that the multimedia stream with the subtitle information can be played in real time. The lecturer can watch the real-time lecture content of the lecturer through the playing terminal, and the lecture content of the lecturer can be better known through the subtitle information, so that the lecture efficiency of the lecturer is improved.

The playing terminal may be a large multimedia playing device used by a large number of listeners, such as a television, a projection device with an audio function, and the like, which is not limited in the present invention. The playing terminal may also be a small multimedia playing device for a single or a small number of listeners, such as a mobile phone, a desktop computer, a tablet computer, and the like, which is not limited in the present invention.

In specific implementation, the multimedia processing server may perform voice recognition on a live video stream and a live audio stream acquired in real time to obtain subtitle information corresponding to the live audio stream, merge the live video stream, the live audio stream, and the subtitle information to obtain a merged media stream, and send the merged media stream after processing to the play terminal in real time. Therefore, the playing terminal can acquire the processed merged media stream in real time, so that the playing terminal can realize the effect of watching the real-time dynamic state of the speaker and the courseware content displayed by the speaker in real time.

In a specific implementation, the multimedia processing server needs to spend a certain time to process voice recognition of a live audio stream and merge the live video stream, the live audio stream, and the subtitle information. Therefore, the multimedia processing server can pre-cache the live video stream and the live audio stream with preset duration so as to ensure stable transmission of the playing terminal. Meanwhile, in order to ensure the effect of real-time playing, the preset duration can be shorter, so that the condition that a listener acquires the combined media stream too late is avoided. As an example of the present invention, the preset time period may be 1 second, 3 seconds, 5 seconds, etc.

In the embodiment of the present invention, the multimedia processing server may further store the merged media stream, so that after the real-time playing is finished, the playing terminal may obtain the merged media stream of the historical playing for viewing, and a speaker may review the speech content.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a real-time playing method according to an embodiment of the present invention is shown, which may specifically include the following steps:

Step 201, acquiring live broadcast video stream and live broadcast audio stream collected in real time;

In the embodiment of the invention, the real-time dynamic state of the speaker and the courseware content displayed by the speaker can be acquired in real time in the speech process of the speaker, so as to obtain the live broadcast video stream acquired in real time. Meanwhile, the speech of the speaker can be acquired in real time, and a live audio stream acquired in real time is obtained.

In an embodiment of the present invention, the step of acquiring a live video stream and a live audio stream collected in real time includes:

S11, acquiring a preset live broadcast video stream and a preset live broadcast audio stream sent by a live broadcast acquisition client; the live broadcast acquisition client is used for acquiring live broadcast video streams acquired by preset video acquisition equipment in real time and acquiring live broadcast audio streams acquired by preset audio acquisition equipment in real time.

In the embodiment of the present invention, the multimedia processing server may obtain the live video stream and the live audio stream from a preset live capture client. The live broadcast acquisition client can be connected with preset video acquisition equipment and preset audio acquisition equipment, acquires live broadcast video streams acquired by the video acquisition equipment in real time, and acquires live broadcast audio streams acquired by the audio acquisition equipment in real time.

In specific implementation, the live broadcast acquisition client can be connected with the video acquisition device and the audio acquisition device, so that the live broadcast acquisition client can acquire the video acquisition device and the audio acquisition device send to the electronic device's live broadcast video stream and the live broadcast audio stream. For example, the live broadcast acquisition client may be a desktop computer, a notebook computer, etc., the video acquisition device may be a camera, the audio acquisition device may be a microphone, the live broadcast acquisition client is connected with the camera and the microphone, and acquires a live broadcast video stream acquired by the camera in real time and a live broadcast audio stream acquired by the microphone in real time.

In specific implementation, the live broadcast acquisition client can also be provided with the video acquisition device and the audio acquisition device, so that the live broadcast acquisition client can call the video acquisition device and the audio acquisition device to acquire the live broadcast video stream and the live broadcast audio stream. For example, live broadcast collection client can be cell-phone, panel computer etc. live broadcast collection client can be equipped with camera and microphone, then video acquisition equipment can be the camera, audio acquisition equipment can be the microphone. Therefore, the live broadcast acquisition client can call the video acquisition equipment and the audio acquisition equipment to acquire the live broadcast video stream and the live broadcast audio stream.

Step 202, performing voice recognition on the live audio stream, and acquiring at least one text message and a timestamp corresponding to the text message;

In the embodiment of the present invention, the live audio stream obtained in real time may be subjected to speech recognition processing, speech information included in the live audio stream is recognized, and the speech information is converted into characters, so as to obtain at least one text information and a timestamp corresponding to the text information.

In a specific implementation, the multimedia service processor may input the live video stream into the speech recognition model, and the speech recognition model may output at least one text message for the live video stream. Meanwhile, the text information may have a corresponding time stamp.

Wherein the text information may contain at least one word, or at least one sentence. The time stamp may be used to represent a start time and an end time of the text information.

Step 203, generating subtitle information corresponding to the live audio stream by using the at least one text message and the timestamp corresponding to the text message;

In the embodiment of the present invention, a caption time axis may be generated by using the timestamp corresponding to the text information, and the text information is added to the corresponding caption time axis, so as to generate caption information corresponding to the live audio stream.

In a specific implementation, the subtitle timeline may include a start time and an end time of at least one line of subtitles. Therefore, the timestamp can be used as the caption time axis, and the text information corresponding to the timestamp can be added to the corresponding caption time axis, so as to generate the caption information corresponding to the live audio stream.

In one embodiment of the invention, the method further comprises:

S21, determining at least one piece of important information in the at least one piece of text information;

In the embodiment of the present invention, there may be important contents that require a high attention of the speaker in the lecture contents of the speaker. After the voice recognition is carried out on the live audio stream to obtain at least one text message, at least one piece of key information is searched in the at least one text message, so that a speaker can better pay attention to the key contents in the speech content. The important information may be information associated with important content in the text information.

In a specific implementation, before the speaker speaks the important content, a prompt may be usually used to prompt the speaker to pay attention to the subsequent lecture content. Therefore, the prompt words in the text information can be identified by adopting a keyword identification mode, so that at least one piece of key information in the at least one piece of text information is determined.

As an example of the present invention, the prompt may be "this topic is an emphasis", "please note below", "you hear carefully", etc. In this way, the keywords may be set to "focus", "take note", "listen and talk" or the like, and the text information including the keywords in the text information may be used as the focus information.

S22, generating key mark information by using the time stamp of the key information;

In the embodiment of the present invention, in order to facilitate a speaker to pay attention to important content in speech content during real-time playing and review a merged media stream of historical playing after the real-time playing is finished, the important mark information may be generated by using a timestamp of the important information. The highlight marking information may be used to record a time point associated with highlight information in the merged media stream.

In a specific implementation, in a real-time playing process, the multimedia processing server may perform voice recognition on a live audio stream acquired in real time to obtain text information. Meanwhile, whether key information exists in the obtained text information or not can be obtained. If the text information has the key information, the key marking information can be generated by adopting the time stamp of the key information.

S23, sending the highlight marking information to the playback terminal, where the playback terminal is configured to mark a time point of the merged media stream corresponding to the highlight marking information.

In the embodiment of the present invention, the key mark information and the merged media stream may be sent to the playback terminal together. The playing terminal can mark the time point of the merged media stream corresponding to the key mark information by adopting the key mark information, and prompt the speaker to pay key attention to the currently played content.

In a specific implementation, the multimedia processing server may send the key mark information and the merged media stream to the playing terminal together. The playing terminal can determine the time point of the combined media stream corresponding to the key mark information based on the key mark information, and display a prompt message when the combined media stream is played to the time point corresponding to the key mark information, so as to mark the time point and prompt the speaker to pay attention to the currently played content.

In a specific implementation, the played terminal may display a time axis during the process of playing the merged media stream. The time axis is used for displaying the time information of the merged media stream. The playing terminal may determine, based on the highlight mark information, a time point of the merged media stream corresponding to the highlight mark information, and highlight and mark the time point in the time axis, so as to prompt the listener to watch the speech content near the time point.

In a specific implementation, after the real-time playing is finished, the merged media stream of the historical playing and the key mark information are stored. When the speaker plays back the merged media stream played in history, the speaker can quickly jump to the corresponding time point in the merged media stream based on the key point mark information, so that the speaker can quickly review key contents in the merged media stream.

Step 204, merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream;

Step 205, sending the merged media stream to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time.

In one embodiment of the invention, the method further comprises:

S31, receiving the interactive information sent by the playing terminal;

In the embodiment of the present invention, in the process of playing and acquiring the merged media stream in real time by the playing terminal, a listener can interact with a speaker according to the speech content of the speaker. For example, questions, suggestions, emotions, and the like are presented to the speaker. Therefore, the audiences can adopt the playing terminal to send out the interactive information so as to receive the interactive information sent by the playing terminal.

In a specific implementation, the interactive information may be in the form of text, picture, voice, video, and the like, which is not limited in this respect. The playing terminal can be provided with an input device, for example, the input device is a camera, a microphone, a touch screen, etc.; alternatively, the playback terminal may be connected to an input device, for example, the input device is a keyboard, a mouse, or the like. The playing terminal can acquire the interactive information input by a speaker through input equipment and send the interactive information to the multimedia processing server, so that the multimedia server can receive the interactive information sent by the playing terminal.

S32, sending the interaction information to the live broadcast acquisition client; the live broadcast acquisition client is used for displaying the interactive information.

In the embodiment of the present invention, after the interactive information is acquired, the interactive information may be sent to the live broadcast acquisition client. The live broadcast acquisition client side can display the interactive information, so that a presenter can check the interactive information displayed by the live broadcast acquisition client side, and can reply to the interactive information according to actual needs, and real-time interaction between the presenter and a listener is realized.

In an embodiment of the present invention, the step of sending the interaction information to the live broadcast collecting client includes:

S41, classifying the interaction information to obtain classified statistical information of the interaction information;

In the embodiment of the invention, in order to facilitate the speaker to know the information fed back by the speaker more clearly, the interaction information can be classified, and the same type of interaction information is collected and counted, so that the classified statistical information of the interaction information is obtained.

The classification statistical information may be statistical information corresponding to at least one classification of the interaction information. For example, the classification statistics may be: the number of the classification belonging to the formula is 31 pieces of interaction information, and the number of the classification belonging to the calculation process is 20 pieces of interaction information.

In a specific implementation, the text analysis may be performed on the acquired interaction information in real time, and the interaction information may be classified based on a result of the text analysis. Specifically, the interactive information may be classified based on keywords in the interactive information. For example, a keyword "formula" may be set, and the interactive information "how this formula is interpreted" and the interactive information "does not understand this formula" may be classified into one category. The number of the at least one type of classified interaction information acquired in real time can be counted to obtain the classified statistical information.

And S42, sending the interactive information and the classification statistical information of the interactive information to the live broadcast acquisition client.

In the embodiment of the present invention, after the classification statistical information of the interaction information is determined, the classification statistical information of the interaction information may be sent to the live broadcast acquisition client while the interaction information is sent to the live broadcast acquisition client. Therefore, the lecturer can check the interactive information and the classified statistical information in real time at the live broadcast acquisition client side, and can more intuitively and quickly know the listening and speaking conditions of the lecturer based on the classified statistical information and interact with the lecturer in real time.

Optionally, the listener may send the interactive message in real time during the process of playing the merged media stream in real time. Therefore, the current time point of the merged media stream when the interactive message is sent can be used as the time information of the interactive message. The interactive information is subjected to text analysis, and based on the result of the text analysis, in the process of classifying the interactive information, the interactive information with the same keywords and similar time information is classified into one type based on the keywords in the interactive information and the time information of the interactive information, and the quantity statistics is carried out to obtain classified statistical information. Therefore, after the speech is finished, the speaker can also check the time period of the speaker focusing attention and the corresponding problem types in the historical speech process, and interact with the speaker.

As an example of the present invention, fig. 3 is a schematic diagram of a real-time playing method according to an embodiment of the present invention. The live capture client 301 may be connected to a video capture device 302, an audio capture device 303, and a multimedia processing server 304. The multimedia processing server 304 is also connected to a play terminal 305. The multimedia processing server 304, the live broadcast acquisition client 301, the video acquisition device 302, the audio acquisition device 303 and the playing terminal 305 can adopt a C/S (client/server) architecture, so that the data transmission speed among different devices is improved. The live capture client 301 can obtain a live video stream captured by the video capture device 302 in real time, and obtain a live audio stream captured by the audio capture device 303 in real time. Thereafter, the live capture client 301 can send the live video stream and the live audio stream to the multimedia processing server 304. The multimedia processing server 304 may obtain a live video stream and a live audio stream collected in real time, perform voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream, merge the live video stream, the live audio stream, and the subtitle information to obtain a merged media stream, and then the multimedia processing server 304 may send the merged media stream to the play terminal 305. The playing terminal 305 may play the obtained merged media stream in real time, so as to realize real-time playing of the merged media stream with the subtitle information.

As another example of the present invention, fig. 4 is a communication schematic diagram of a real-time playing method according to an embodiment of the present invention. The real-time playing method can comprise the following steps:

1. And the live broadcast acquisition client sends live broadcast video stream and live broadcast audio stream acquired in real time to the multimedia processing server.

Then, the multimedia processing server can perform voice recognition on the live audio stream to acquire at least one text message and a timestamp corresponding to the text message; generating subtitle information corresponding to the live audio stream by adopting the at least one piece of text information and the timestamp corresponding to the text information; and merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream.

Meanwhile, the multimedia processing server can also determine at least one piece of key information in the at least one piece of text information; and generating key mark information by adopting the time stamp of the key information.

2. The playing terminal sends a real-time playing request to the multimedia processing server;

3. The multimedia processing server sends a real-time combined media stream to the playing terminal;

4. The playing terminal sends interactive information to the multimedia server;

Then, the multimedia processing server classifies the interactive information to obtain classified statistical information of the interactive information;

5. And sending the interactive information and the classification statistical information of the interactive information to the live broadcast acquisition client.

According to the real-time playing method provided by the embodiment of the invention, at least one text message and a timestamp corresponding to the text message are obtained by acquiring a live video stream and a live audio stream which are acquired in real time and carrying out voice recognition on the live audio stream; generating subtitle information corresponding to the live broadcast audio stream by adopting the at least one piece of text information and a timestamp corresponding to the text information, merging the live broadcast video stream, the live broadcast audio stream and the subtitle information to obtain a merged media stream, and sending the merged media stream to a preset playing terminal; the playing terminal is used for playing the obtained merged media stream in real time. So that the play terminal can acquire the merged media stream with the subtitle information. The lecturer can better understand the lecture content of the lecturer based on the subtitle information in the combined media stream, and the lecture efficiency of the lecturer is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 5, a block diagram of a real-time playing device according to an embodiment of the present invention is shown, and the real-time playing device specifically includes the following modules:

An obtaining module 501, configured to obtain a live video stream and a live audio stream acquired in real time;

A subtitle generating module 502, configured to perform voice recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream;

A merging module 503, configured to merge the live video stream, the live audio stream, and the subtitle information to obtain a merged media stream;

A sending module 504, configured to send the merged media stream to a preset playback terminal; the playing terminal is used for playing the obtained merged media stream in real time.

Optionally, the subtitle generating module includes:

Optionally, the apparatus further comprises:

Optionally, the obtaining module includes:

Optionally, the apparatus further comprises:

Optionally, the interactive transmission module includes:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Referring to fig. 6, a schematic diagram of an embodiment of a real-time playing system according to an embodiment of the present invention is shown, where the system includes a live broadcast acquisition client 601, a multimedia processing server 602, and a playing terminal 603.

The live broadcast acquisition client 601 is configured to send a live broadcast video stream and a live broadcast audio stream acquired in real time to the multimedia processing server 602;

The multimedia processing server 602 is configured to perform speech recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream; merging the live video stream, the live audio stream and the subtitle information to obtain a merged media stream; sending the merged media stream to the play terminal 603;

The playing terminal 603 is configured to play the obtained merged media stream in real time.

In the embodiment of the present invention, the live broadcast acquisition client may be connected to a preset video acquisition device and a preset audio acquisition device, and acquire a live broadcast video stream acquired by the video acquisition device in real time, acquire a live broadcast audio stream acquired by the audio acquisition device in real time, and send the live broadcast video stream and the live broadcast audio stream acquired in real time to the multimedia processing server 602.

In the embodiment of the present invention, the multimedia processing server 602 continuously obtains the live video stream and the live audio stream that are collected by the live collection client 601 in real time. And carrying out voice recognition processing on the live audio stream acquired in real time, recognizing voice information contained in the live audio stream, converting the voice information into characters, and obtaining subtitle information corresponding to the live audio stream. And merging the separated live broadcast video stream, the live broadcast audio stream and the subtitle information to obtain a merged media stream containing the live broadcast video stream, the live broadcast audio stream and the subtitle information. And sends the merged media stream to the cast terminal 603

In the embodiment of the present invention, the playing terminal 603 may play the obtained merged media stream in real time, so as to play the merged media stream with the subtitle information in real time. The lecturer can watch the real-time lecture content of the lecturer through the playing terminal, and the lecture content of the lecturer can be better known through the subtitle information, so that the lecture efficiency of the lecturer is improved.

An embodiment of the present invention further provides an apparatus, including:

One or more processors; and

One or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform methods as described in embodiments of the invention.

Embodiments of the invention also provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the methods described in embodiments of the invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The real-time playing method and the real-time playing device provided by the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A real-time playing method, the method comprising:

2. The method according to claim 1, wherein the step of performing speech recognition processing on the live audio stream to obtain subtitle information corresponding to the live audio stream comprises:

3. The method of claim 2, further comprising:

Generating key mark information by adopting the time stamp of the key information;

And sending the key mark information to the playing terminal, wherein the playing terminal is used for marking the time point of the merged media stream corresponding to the key mark information.

4. The method of claim 1, wherein the step of obtaining live video streams and live audio streams captured in real time comprises:

5. The method of claim 1 or 4, further comprising:

Receiving interactive information sent by the playing terminal;

6. The method of claim 5, wherein the step of sending the interaction information to the live capture client comprises:

7. A real-time playback apparatus, the apparatus comprising:

8. A real-time playing system is characterized by comprising a live broadcast acquisition client, a multimedia processing server and a playing terminal;

9. An apparatus, comprising:

One or more processors; and

One or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of one or more of claims 1-6.

10. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause the processors to perform the method of one or more of claims 1-6.