CN116112718A - Multi-person wheat connecting method and device - Google Patents

Multi-person wheat connecting method and device Download PDF

Info

Publication number
CN116112718A
CN116112718A CN202211636748.1A CN202211636748A CN116112718A CN 116112718 A CN116112718 A CN 116112718A CN 202211636748 A CN202211636748 A CN 202211636748A CN 116112718 A CN116112718 A CN 116112718A
Authority
CN
China
Prior art keywords
audio
stream
current time
user
wheat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211636748.1A
Other languages
Chinese (zh)
Inventor
张捷
彭卓腾
文峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202211636748.1A priority Critical patent/CN116112718A/en
Publication of CN116112718A publication Critical patent/CN116112718A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/08Mouthpieces; Microphones; Attachments therefor

Abstract

The application provides a multi-person wheat connecting method and a device, wherein the multi-person wheat connecting method comprises the following steps: acquiring multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams; determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently; writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream; pushing the mixed stream to the user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time, and identifying in the current wheat connecting interface. Therefore, by writing the audio attribute into the target video stream, the corresponding audio attribute is transferred to the user side while the audio stream is transferred to the user side, the time delay of the audio attribute is avoided, the current speaking wheat connecting user can be accurately indicated, and the use experience of the wheat connecting user is ensured.

Description

Multi-person wheat connecting method and device
Technical Field
The application relates to the technical field of computers, in particular to a multi-person wheat connecting method. The present application also relates to a multi-person headset, a computing device, and a computer-readable storage medium.
Background
With the rapid development of computer and internet technologies, online business rapidly develops, more and more users can interact online, so as to meet the demands of daily life, study, work and the like, for example, in order to meet the interaction demands of people when watching live broadcast, a host can interact with other hosts or audiences in a continuous mode, so that the live broadcast effect is improved; for another example, multiple users can participate in online conferences through wheat connection, so that the conferences can be unfolded at any time and any place, the conference cost is reduced, and the like.
In the prior art, in a multi-user communication scene, a mixed stream of multi-user audio and video is played by a user side, at the moment, audio attributes can be stored in a server side, and when the user side receives the mixed stream of a plurality of communication users, a signaling channel is established with the server side to acquire corresponding audio attributes, so that different effects are rendered.
However, in the above method, the audio attribute is obtained by establishing a signaling channel with the server, and the polling request or the server pushes the audio attribute, and the audio attribute obtained by the user terminal and the audio stream cannot be completely consistent in time, so that the audio attribute and the audio stream cannot be completely synchronized, and the use experience of the communication user is affected.
Disclosure of Invention
In view of the foregoing, embodiments of the present application provide a multi-person wheat connecting method. The application relates to a multi-person wheat connecting device, a computing device and a computer readable storage medium, so as to solve the technical problem that in the prior art, time delay exists in audio attributes, and the audio attributes and audio streams cannot be completely synchronized.
According to a first aspect of embodiments of the present application, there is provided a multi-person wheat connecting method applied to a content distribution network, including:
acquiring multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams;
determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently;
writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream;
pushing the mixed stream to the user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time, and identifying in the current wheat connecting interface.
According to a second aspect of embodiments of the present application, a multi-user wheat connecting method is provided, applied to a user terminal, including:
Receiving a mixed stream pushed by a content distribution network, wherein the mixed stream comprises an audio stream and a target video stream, the target video stream carries audio attributes of the current time written by the content distribution network, the audio attributes of the current time are determined based on the audio streams of all the wheat-connected users, and the audio attributes of the current time are used for indicating the wheat-connected users who speak currently;
obtaining a target video stream from the mixed stream, and extracting the audio attribute of the current time from the target video stream;
the audio attribute of the current time is identified in the current headset interface.
According to a third aspect of embodiments of the present application, there is provided a content distribution network comprising:
the first acquisition module is configured to acquire multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams;
the determining module is configured to determine the audio attribute of the current time according to the audio streams of all the wheat-connected users, wherein the audio attribute of the current time is used for indicating the wheat-connected users who speak currently;
the mixing module is configured to write the audio attribute of the current time into the corresponding time of the target video stream, and mix the audio stream and the target video stream to obtain a mixed stream;
And the pushing module is configured to push the mixed stream to the user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time and is identified in the current wheat connecting interface.
According to a fourth aspect of an embodiment of the present application, there is provided a client, including:
the receiving module is configured to receive a mixed stream pushed by the content distribution network, wherein the mixed stream comprises an audio stream and a target video stream, the target video stream carries audio attributes of the current time written by the content distribution network, the audio attributes of the current time are determined based on the audio streams of all the wheat-connected users, and the audio attributes of the current time are used for indicating the wheat-connected users who speak currently;
the extraction module is configured to acquire a target video stream from the mixed stream and extract audio attributes from the target video stream;
and the identification module is configured to identify the audio attribute of the current time in the current wheat connecting interface.
According to a fifth aspect of embodiments of the present application, there is provided a multi-person headset system, the multi-person headset system including a content distribution network and a user side;
the content distribution network is configured to acquire multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams; determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently; writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream; pushing the mixed stream to a user side;
The user terminal is configured to receive the mixed stream pushed by the content distribution network; obtaining a target video stream from the mixed stream, and extracting the audio attribute of the current time from the target video stream; the audio attribute of the current time is identified in the current headset interface.
According to a sixth aspect of embodiments of the present application, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the operation steps of any of the multi-person wheat connecting method.
According to a seventh aspect of embodiments of the present application, there is provided a computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the steps of any of the above-described multiplayer wheat connection methods.
According to the multi-user wheat connecting method provided by the embodiment of the application, the content distribution network can acquire at least two multimedia streams corresponding to the wheat connecting users, wherein the multimedia streams comprise audio streams; determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently; writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream; pushing the mixed stream to the user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time, and identifying in the current wheat connecting interface.
Under the condition, the audio streams of at least two continuous wheat users can be analyzed, the audio attribute of the current time is determined, the audio attribute of the current time can indicate the continuous wheat user who currently speaks, then the audio attribute of the current time is written into the corresponding time of the target video stream, the audio stream and the target video stream are packed and mixed, and meanwhile pushed to the user side, so that when the user side receives the mixed stream, the audio stream to be played and the audio attribute of the current time can be obtained simultaneously by directly analyzing the mixed stream, and identification is carried out in the current continuous wheat interface, so that the continuous wheat user who currently speaks is indicated to the user who watches continuous wheat. Therefore, by writing the audio attribute of the current time into the target video stream, the audio attribute corresponding to the audio attribute is transferred to the user side while the audio stream is transferred to the user side, the time delay of the audio attribute is avoided, the audio attribute acquired by the user side is completely synchronous with the audio stream, the current speaking wheat connecting user identified in the current wheat connecting interface is ensured to be synchronous with the current actual speaking wheat connecting user, and therefore the use experience of the wheat connecting user is ensured.
Drawings
FIG. 1 is a schematic illustration of a double-deck treatment process according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for connecting multiple persons with wheat according to an embodiment of the present application;
FIG. 3a is a flow chart of another multi-person wheat connecting method according to an embodiment of the present application;
FIG. 3b is a schematic diagram of a wheat-along interface according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a multi-person headset method applied to a dual-person pure audio headset scene according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a content distribution network according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a client according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a multi-person headset system according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a computing device according to one embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.
The terminology used in one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of one or more embodiments of the application. As used in this application in one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present application will be explained.
The anchor: the user initiating the live broadcast can see the live broadcast picture and sound from the live broadcast room of the host broadcast.
Wheat connecting users: a user who interacts with an initiator in a wheat-connection mode can see the wheat-connection picture and hear the speech of the participant in a conference scene; in a live broadcast scene, a wheat connecting user can refer to a user who performs wheat connecting interaction with a host, other audiences can see/hear pictures and sounds of the live broadcast room in the live broadcast room during the wheat connecting, and the wheat connecting user can be other host broadcasting, common audiences in the live broadcast room, one person or a plurality of persons.
Flow: an audio video stream, typically containing both video pictures and audio sounds, may also contain only video pictures or only audio sounds.
Live stream: to a live room so that the viewer can actually see the stream; in the non-wheat connection state, the picture/sound is the same as the main broadcasting stream, namely the main broadcasting live broadcast picture/sound; in the wheat linking state, the mixed stream is the stream which contains the pictures/sounds of the host and the wheat linking users.
Mixed flow: and mixing the multi-channel audio and video streams of the anchor and the link users together to form one channel of audio and video stream, so that the multi-channel audio and video stream is distributed to a live broadcast room as a live broadcast stream for viewing by audiences.
mock video: videos that conform to video standards but have no valid video content are used only to simulate real video to enable the use of SEI to convey information.
SEI: supplemental enhancement information, supplemental enhancement information, which belongs to the code stream category, additional information defined in video standards (e.g., h.264/h.265), supports the insertion of user-defined information.
DBFS: decibels relative to full scale, a volume representation method commonly used in digital audio systems, typically ranges from-96 to 0dbfs, with 0 representing the maximum volume.
GOP: group of pictures, a term common in the video arts, refers to the interval between two I frames. I-frames (I-frames), also known as intra pictures (intra pictures), are typically the first frame of each GOP, moderately compressed, and serve as reference points for random access, and may be considered as pictures.
When a plurality of persons are connected with the microphone, the user end listens to the stream after the mixed stream of the multi-person speaking, generally only knows that someone is speaking, cannot know who is speaking at all, and does not know the volume of the speaking, so that the audio attribute (volume, identification of the speaker and the like) corresponding to the audio can be stored in the server end during the mixed stream; when the end user side is presented, the user side obtains the audio attribute from the server side, and when the audio stream is played, different effects such as a horn icon, a sound column, an avatar special effect and the like are rendered.
Taking a live broadcast scenario as an example, fig. 1 is a schematic diagram of a processing procedure of double-person continuous wheat provided in an embodiment of the present application, as shown in fig. 1, a host player pushes a host stream to a mixed stream module of a content distribution network, a continuous wheat user pushes a continuous wheat user stream to a mixed stream module of the content distribution network, the mixed stream module mixes the host stream and the continuous wheat user stream to generate a live broadcast stream, and simultaneously, audio attributes corresponding to the host stream and the continuous wheat user stream are determined, and the audio attributes are stored to a server. The content distribution network pushes the live stream to a live broadcasting room for users to watch or listen, the live broadcasting room inquires corresponding audio attributes from a server, and the corresponding information such as a speaker, volume and the like is rendered in the link interface by combining the audio attributes.
As can be seen from the above, the method for obtaining the audio attribute is obtained by establishing a signaling channel with the server, and polling the request or pushing the server; whereas the delivery of the audio streams is typically via a content delivery network (Content Delivery Network, CDN). Typically, the typical value of the delay for an audio stream to be distributed via the content distribution network is about 3-10 seconds, while the typical value of the delay for signaling to be obtained via the signaling channel is typically about hundred milliseconds to two seconds; therefore, the audio attribute acquired by the user terminal cannot be completely consistent with the audio stream in time, so that the problem of incomplete synchronization exists, such as that a person is speaking, and an icon of speaking of the person is displayed just before.
Therefore, the embodiment of the application provides the multi-user wheat connecting method, the audio attribute at the current time is written into the video stream, and the video stream and the audio stream are packaged and mixed together and then distributed to the user side, so that when the user side receives the mixed stream, the audio stream to be played can be obtained simultaneously by directly analyzing the mixed stream, and the audio attribute at the current time is completely synchronous with the audio stream, thereby ensuring that the sound heard by the audience is completely synchronous with the presentation of the seen sound icons/sound columns and the like, and improving the user experience.
In the present application, a multi-person wheat connecting method is provided, and the present application relates to a multi-person wheat connecting device, a computing device, and a computer readable storage medium, which are described in detail in the following embodiments.
Fig. 2 shows a flowchart of a multi-user wheat connecting method according to an embodiment of the present application, which is applied to a content distribution network, and specifically includes the following steps 202-208:
step 202: and acquiring multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams.
Specifically, the headset users are users needing interaction, including headset initiating users and headset participating users. In an online conference scene, a wheat-connected user is each participant participating in the online conference; in a live broadcast scene, a wheat-connecting user is a host for initiating the wheat connection, and spectators or other host players participating in the wheat connection.
In addition, the multimedia stream is multimedia data of each communication user, and needs to be transmitted to other communication user terminals or user terminals of audiences for playing, and the multimedia stream at least comprises an audio stream, wherein the audio stream is audio data needed to be played by a subsequent user terminal.
It should be noted that, the multimedia streams of each wheat-connected user may be pushed to the content distribution network, and mixed by the content distribution network, and then pushed to each user end for playing after obtaining the mixed stream, where the user end may be a client end of the wheat-connected user participating in the wheat connection, or may be a client end of a watching user watching the wheat connection.
Taking 2 users connected with wheat in a live broadcast scene as an example, a main broadcasting end initiating the wheat connection collects multimedia data of the main broadcasting of the wheat connection, generates a multimedia stream, and pushes the multimedia stream to a mixed stream module of a content distribution network; the communication user terminal collects the multimedia data of the communication user, generates multimedia streams, pushes the multimedia streams to the mixed stream module of the content distribution network, and the content distribution network can acquire the multimedia streams corresponding to the two communication users so as to process the multimedia streams subsequently.
Step 204: and determining the audio attribute of the current time according to the audio streams of all the headset users, wherein the audio attribute of the current time is used for indicating the headset users who speak currently.
It should be noted that, the audio corresponds to a plurality of audio attributes, such as a sender of the audio, that is, a user identifier of a headset user corresponding to the audio, volume information of the audio, and the like. Thus, for each audio stream of the headset user, the audio attribute of the current time, that is, the user identifier, the volume information, etc. of the corresponding headset user may be determined according to the audio stream, that is, the audio attribute of the current time may include the volume information of the audio and the user identifier of the corresponding headset user, so as to indicate the headset user who is currently speaking, what the current speaking volume is, etc.
In practical applications, a sound information acquisition sub-module may be added to a mixed stream module of the content distribution network, where the sound information acquisition sub-module is responsible for extracting required audio attributes, such as corresponding user identifiers and volume information, from an input audio stream. The volume information may be represented by any common representation, such as dBFS, which is commonly used in digital audio systems, that is, 0dB is the maximum value, -96dB is the minimum value, or by a more intuitive and simple digital representation, such as 0 to 100.
In an optional implementation manner of this embodiment, the audio attribute of the audio stream may be obtained based on the set obtaining frequency, that is, the audio attribute of the current time may be determined according to the audio stream of each headset user, and the specific implementation process may be as follows:
acquiring a set insertion frequency, and determining a corresponding information acquisition frequency according to the set insertion frequency;
determining corresponding information acquisition time according to the information acquisition frequency;
and determining the audio attribute of the current time according to the audio stream of each wheat-connected user when the information acquisition time is reached.
Specifically, the set insertion frequency refers to a preset frequency of inserting the audio attribute of the current time into the target video stream, such as 1 time per second, 10 times per second, etc.; the information acquisition frequency refers to a frequency of acquiring an audio attribute of the current time from an audio stream.
In practical application, the set insertion frequency can determine the information acquisition frequency, and the set insertion frequency can be the same as the information acquisition frequency, for example, the set insertion frequency is 1 time of insertion per second, and the information acquisition frequency is 1 time of acquisition per second; alternatively, the insertion frequency may be set to be a multiple of the information acquisition frequency, for example, the insertion frequency is set to be 1 time per second, but the information acquisition frequency is set to be 10 times per second, so that the obtained 10 audio attributes are indicated to be inserted into the target video stream only once, and for example, the insertion frequency is set to be 10 times per second, but the information acquisition frequency is set to be 1 time per second, so that the obtained 1 audio attributes are indicated to be repeatedly inserted into the target video stream 10 times.
It should be noted that if the audio attribute is obtained from the audio stream but is not inserted into the target video stream, the obtained audio attribute is garbage, resulting in waste of processing resources, and thus it is preferable that the set insertion frequency and the information obtaining frequency be the same.
In specific implementation, the sound information acquisition sub-module can acquire a set insertion frequency, and determine a corresponding information acquisition frequency according to the set insertion frequency; then, the sound information acquisition sub-module can determine corresponding information acquisition time according to the information acquisition frequency, and determine the audio attribute of the current time according to the audio stream of each wheat-connected user when each information acquisition time is reached. For example, the information acquisition frequency is 1 time per second, then each information acquisition time is 1 second, 2 seconds, 3 seconds … …, etc., the current time is monitored, and each time an information acquisition time is reached, the audio attribute of the current time is extracted from the audio stream of each communication user.
In the embodiment of the application, the set insertion frequency can be adjusted as required, such as 1 time per second, 10 times per second, and the like; the setting of the insertion frequency may also be linked with the information acquisition frequency of the sound information acquisition sub-module, that is, the setting of the insertion frequency determines the stream information acquisition frequency, for example, the setting of the insertion frequency is 1 time per second, and the sound information acquisition sub-module may also acquire the audio attribute only once per second, thereby reducing the system consumption.
In an optional implementation manner of this embodiment, the volume information of the current transmission audio may be calculated based on phonemes in the audio stream to determine a corresponding audio attribute, that is, the audio attribute of the current time may be determined according to the audio stream of each headset user, which may be specifically implemented as follows:
acquiring phoneme information of the current transmission audio from a target audio stream and a user identifier of a target link user corresponding to the target audio stream, wherein the target audio stream is an audio stream of any link user;
determining volume information of the current transmission audio according to the phoneme information;
and taking the volume information and the user identification of the target headset user as the audio attribute of the current time.
Specifically, a phoneme is a minimum unit or a minimum speech segment constituting a syllable, is a minimum linear speech unit divided from a sound quality perspective, and is a physical phenomenon that specifically exists. For the target audio stream, the phoneme information of the current transmission audio can be obtained from the target audio stream, the volume information of the current time is calculated based on the phoneme information, the volume information and the corresponding user identification are combined to be used as the audio attribute of the current time, and the like, the audio stream of each communication user can be used as the target audio stream, the volume information of the current time is calculated, and the volume information and the user identification of the corresponding communication user are used as the audio attribute.
That is, the audio attribute of the current time includes the volume information of each communication user at the current time, for example, the audio attribute of the current time is: user 1, volume 0; user 2, volume 50; user 3, volume 10; user 4, volume 0, the audio attribute of this current time can be inserted into the target video stream later, synchronously push to the user side.
It should be noted that, the sound information obtaining sub-module may obtain the phoneme information of the current transmission audio from the target audio stream, and the user identifier of the target headset user corresponding to the target audio stream; then, according to the phoneme information, the volume information of the current transmission audio can be determined; and then, taking the volume information and the user identification of the target headset user as the audio attribute of the current time. Therefore, the audio attribute of the current time can indicate the volume information of each microphone-connected user at the current time, so that the information of the microphone-connected user, the speaking volume and the like of the current speaking is indicated, the audio attribute of the current time can be inserted into the target video stream and transmitted to the user side together with the audio stream, the user side can synchronously acquire the audio stream to be played and the audio attribute of the current time, and delay of the audio attribute is avoided.
The volume represents the intensity of sound, also called loudness, intensity or energy, and refers to subjective feeling of human ears on the intensity of the heard sound, and the objective evaluation scale is the amplitude of the sound, and the feeling is derived from the pressure generated when an object vibrates, namely sound pressure, and the vibration energy of the object is transmitted through different media.
In particular, the volume information of the current transmission audio can be calculated by analogy according to the magnitude of the signal amplitude in one phoneme, and the volume information can be calculated by the following two methods: taking the sum of absolute values (signal absolute values) of all phonemes in the current transmission audio as volume information of the current transmission audio; and taking the sum of signal value square values of all phonemes in the current transmission audio, taking a logarithmic value based on 10, and multiplying the logarithmic value by 10 to obtain the volume information of the current transmission audio. Of course, the volume information of the currently transmitted audio may also be calculated in other manners, which is not limited by the embodiment of the present application.
According to the method and the device for transmitting the audio, the phoneme information of the current transmission audio can be obtained from the target audio stream, the volume information of the current transmission audio is determined based on the phoneme information, so that the audio attribute of the current time is obtained, the audio attribute of the current time can be inserted into the target video stream to be synchronously transmitted to the user side, the time delay of the audio attribute is avoided, the audio attribute obtained by the user side is completely synchronous with the audio stream, the current speaking's ' headset user identified in the current headset interface is ensured to be synchronous with the current actual speaking's headset user, and therefore the use experience of the headset user is ensured.
In an optional implementation manner of this embodiment, after determining the audio attribute of the current time according to the audio stream of each headset user, the method further includes:
creating a virtual video stream, and taking the virtual video stream as a target video stream, wherein the virtual video stream is a video stream which does not contain effective video content.
It should be noted that, in the case that the multimedia stream includes only an audio stream, that is, in a pure audio continuous-time scene, in order to record the audio attribute of the current time of the audio stream through the video stream, a virtual video stream may be created, and the created virtual video stream is used as a target video stream for inserting the obtained audio attribute of the current time.
The virtual video stream is a video stream which has no effective video content but accords with a video standard, and the basic requirement is to accord with the video standard (such as H264/H265 and the like), and the code rate is as low as possible. That is, the virtual video stream does not need to include the actual and effective video content, and only needs to insert additional information in a self-defined manner, so as to insert the audio attribute of the current time of the audio stream into the virtual video stream, and further transmit the audio attribute of the current time of the audio stream to the user terminal in the mixed stream through the virtual video stream.
For example, the virtual video stream may be a mock video, where a mock object is created to simulate the behavior of an object for some objects that are not easy to construct/obtain during the test, and the virtual object is a mock object, and the mock object is a substitute for a real object. That is, mock video is a video that conforms to a video standard but has no valid video content, and is used only to simulate real video to enable information delivery using SEI.
In this embodiment of the present invention, a video creation sub-module is further added to a mixed-stream module of a content distribution network, where the video creation sub-module may generate a video stream that has no effective video content but meets a video standard under a pure audio continuous-with-microphone scene, and the basic requirement is to meet the video standard (such as H264/H265, etc.), and meanwhile, the code rate is as low as possible, for example, a pure black picture with a resolution of 32x32 (or lower) may be used to perform encoding generation, so as to achieve an extremely low code rate. Therefore, in the pure audio communication scene, the virtual video stream can be automatically created, so that the audio attribute of the current time of the audio stream can be transmitted through the SEI of the video, the time delay of the audio attribute is avoided, the audio attribute acquired by the user side is completely synchronous with the audio stream, and the use experience of a communication user is ensured.
In an optional implementation manner of this embodiment, a virtual video stream is created, which is specifically implemented as follows:
pre-coding virtual video data with set duration;
and circularly copying the virtual video data to obtain a virtual video stream.
It should be noted that, because video coding itself is an operation consuming a great deal of performance, i.e. creating a virtual video stream consumes a great deal of performance, but the virtual video stream does not need to be played to a user, and does not need actual picture content, so in order to save performance, a segment of virtual video data with a set duration can be pre-coded, and then the virtual video data with the segment of duration is continuously and circularly copied, so as to obtain a corresponding virtual video stream.
In practical applications, the set duration may be set in a user-defined manner, such as 1 second, 2 seconds, etc., or the set duration may be GOP of video, i.e. the time interval between two I frames. The video creation sub-module can pre-encode a section of black and white video data within a set time period, and then can acquire the created virtual video stream only by simply and circularly copying the section of video data in the process of generating the mock video in real time, thereby greatly reducing the performance consumed when the stream creates the virtual video stream.
In an optional implementation manner of this embodiment, the multimedia stream further includes a video stream; after determining the audio attribute of the current time according to the audio stream of each wheat-connected user, the method further comprises the following steps:
acquiring a video stream from a multimedia stream;
and taking the obtained video stream as a target video stream.
It should be noted that, the multi-person wheat connecting method provided by the embodiment of the application not only can be suitable for a pure audio wheat connecting scene, but also can be suitable for an audio/video wheat connecting scene. In the audio-video communication scene, the multimedia stream acquired by the mixed stream module of the content distribution network comprises the video stream in addition to the audio stream, and the video stream in the multimedia stream can be directly used as a target video stream. Therefore, in the scene of audio and video communication, the virtual video stream does not need to be automatically created, the audio attribute of the audio stream is directly transmitted through the video stream in the multimedia stream, the operation is more convenient, the processing resource is saved, the time delay of the audio attribute is avoided, the audio attribute acquired by the user terminal is completely synchronous with the audio stream, and the use experience of the communication user is ensured.
Step 206: and writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream.
It should be noted that, a sound information inserting sub-module is further added in the mixed stream module of the content distribution network, and the sound information inserting sub-module can write the audio attribute of the current time into the corresponding time of the target video stream, so that the target video stream can carry the audio attribute of the current time of the audio stream, and then the mixing sub-module in the mixed stream module can mix the audio stream and the target video stream to obtain a mixed stream. The audio stream refers to an audio stream of each communication user or a synthesized audio stream obtained by mixing the audio streams of each communication user, and the target video stream carries an audio attribute of the current time of the audio stream.
In practical application, the mixing submodule in the mixed stream module of the content distribution network is updated, so that the audio streams of all the users connected with the wheat can be mixed into a mixed stream, and meanwhile, the target video stream is supported to be mixed into the mixed stream, and under the condition that the target video stream is a created virtual video stream, as the virtual video stream is already coded, the virtual video stream does not need to be subjected to additional decoding and coding, only needs to be packed into the mixed stream, and basically has no additional performance consumption.
In the embodiment of the application, the audio streams of all the wheat connecting users and the audio attribute of the current time corresponding to the audio streams are simultaneously included in the mixed stream, so that the audio streams of all the wheat connecting users and the audio attribute of the audio streams at the current time are completely synchronous, the wheat connecting users who speak at present are accurately indicated, time delay is avoided, and user experience is ensured.
In an optional implementation manner of this embodiment, the writing of the audio attribute of the current time to the corresponding time of the target video stream may be implemented as follows:
and writing the audio attribute of the current time into the supplemental enhancement information SEI of the target video according to the corresponding time.
It should be noted that SEI is a supplemental enhancement information of video, and supports custom insertion, so that the audio attribute of the current time can be written into the target video stream by using the SEI of video. That is, the sound information inserting sub-module is responsible for inserting the audio attribute of the current time into the SEI of the target video stream according to the corresponding time. Therefore, the characteristic that the SEI of the video can be inserted in a self-defining mode is utilized, the audio attribute of the current time is written into the supplemental enhancement information SEI of the target video, so that the target video stream can carry the audio attribute of the current time of the audio stream, the audio attribute of the current time is synchronously transmitted to the user side, and time delay is avoided.
Step 208: pushing the mixed stream to the user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time, and identifying in the current wheat connecting interface.
In practical application, after the mixed stream module of the content distribution network obtains the mixed stream, the mixed stream can be pushed to the user side, and the mixed stream comprises the audio stream to be played and the target video stream carrying the audio attribute of the current time of the audio stream, so that the user side can analyze the target video stream in the mixed stream to obtain the audio attribute of the current time carried in the target video stream, thereby identifying in the current wheat connecting interface.
Specifically, the audio attribute of the current time may include volume information of audio and a user identifier of a corresponding headset user. Therefore, after the user terminal analyzes the audio attribute of the current time, the user terminal can determine the target user head portrait corresponding to the headset user in the current headset interface according to the user identification, and add the sounding icon and/or the volume icon corresponding to the volume information on the target user head portrait.
It should be noted that, because the audio attribute of the current time includes the volume information of each microphone-connected user at the current time, some volume information indicates that the corresponding microphone-connected user is speaking currently, and some volume information indicates that the corresponding microphone-connected user is not speaking, if the audio attribute of the current time is: user 1, volume 0; user 2, volume 50; user 3, volume 10; user 4 and volume 0, i.e. user 1 and user 4 do not speak, and user 3 has a smaller volume, possibly as noise, and does not speak. Therefore, the current speaking wheat-connected user can be determined based on the audio attribute of the current time, and the current speaking wheat-connected user is identified.
In particular, when the method is implemented, whether the volume of each microphone connecting user is larger than a volume threshold value can be determined, if so, the microphone connecting user is the microphone connecting user who currently speaks, at the moment, a target user head portrait of the microphone connecting user who currently speaks can be found, and sound production icons and/or volume icons corresponding to volume information are added on the target user head portrait. More than one microphone connecting user can be used for speaking currently, as long as the volume is larger than the volume threshold, the microphone connecting user can be used for speaking currently, and the specific volume is marked.
The sound-producing icon indicates that the microphone-connected user is currently speaking, and the sound-producing icon can be preset, for example, the sound-producing icon is a small horn icon, a thickened border, an avatar special effect or other graphic marks and the like; the volume icon may represent the volume level of the current time, e.g., the volume icon may be a volume bar, a specific value, or other graphical indicia that may represent the level, etc.
By way of example, in a live broadcast scenario, real-time audio attributes are already brought into the mixed stream, when a viewer opens the live broadcast room, the player on the viewer side can extract SEI information in the target video stream from the mixed stream, acquire the audio attributes of the current time in real time, know volume information of each link user, so as to know who is speaking currently, how the speaking volume is, and the like, and further draw effects in real time, such as head portraits special effects, horns, sound columns, and the like, corresponding to the anchor/link users, so as to achieve better experience.
According to the multi-user wheat connecting method, at least two audio streams of the wheat connecting users can be analyzed, the audio attribute of the current time is determined, the audio attribute of the current time can indicate the wheat connecting user who is currently speaking, then the audio attribute of the current time is written into the corresponding time of the target video stream, the audio stream and the target video stream are packed and mixed, meanwhile, the audio stream and the target video stream are pushed to the user side, when the user side receives the mixed stream, the mixed stream can be directly analyzed to obtain the audio stream to be played at the same time, and the audio attribute of the current time is identified in the current wheat connecting interface, so that the user who is watching the wheat connecting can indicate the wheat connecting user who is currently speaking. Therefore, by writing the audio attribute of the current time into the target video stream, the audio attribute corresponding to the audio attribute is transferred to the user side while the audio stream is transferred to the user side, the time delay of the audio attribute of the current time is avoided, the audio attribute acquired by the user side is completely synchronous with the audio stream, the current speaking wheat connecting user identified in the current wheat connecting interface is ensured to be synchronous with the current actual speaking wheat connecting user, and therefore the use experience of the wheat connecting user is ensured.
Fig. 3a shows a flowchart of another multi-user wheat connecting method according to an embodiment of the present application, which is applied to a user terminal, and specifically includes the following steps 302-306:
step 302: and receiving a mixed stream pushed by the content distribution network, wherein the mixed stream comprises an audio stream and a target video stream, the target video stream carries the audio attribute of the current time written by the content distribution network, the audio attribute of the current time is determined based on the audio streams of all the headset users, and the audio attribute of the current time is used for indicating the headset users who speak currently.
It should be noted that the target video stream may be a virtual video stream created by the content distribution network, or may also be a video stream included in a multimedia stream pushed by the link user. The content distribution network may obtain the audio attribute of the current time from the audio stream in the multimedia stream pushed by the link user, and write the audio attribute into the SEI of the target video stream, so that the mixed stream received by the user includes both the audio stream to be played and the target video stream carrying the audio attribute of the current time.
Step 304: the target video stream is obtained from the mixed stream, and the audio attribute of the current time is extracted from the target video stream.
It should be noted that, the user side may parse the target video stream, and extract the audio attribute of the current time from the SEI of the target video stream.
Step 306: the audio attribute of the current time is identified in the current headset interface.
In an optional implementation manner of this embodiment, the audio attribute includes volume information and a user identifier corresponding to the headset user, and the audio attribute of the current time is identified in the current headset interface, which may be specifically implemented as follows:
determining a target user head portrait corresponding to the wheat connecting user in the current wheat connecting interface according to the user identification;
and adding sound production icons and/or volume icons corresponding to the volume information on the target user head portrait.
It should be noted that, the sounding icon indicates that the headset user is speaking currently, and the sounding icon may be preset, for example, the sounding icon is a small horn icon, a thickened border, a head portrait special effect or other graphic marks; the volume icon may represent the volume level of the current time, e.g., the volume icon may be a volume bar, a specific value, or other graphical indicia that may represent the level, etc.
For example, fig. 3b is a schematic diagram of a headset interface provided in an embodiment of the present application, as shown in fig. 3b, a headset user is user 1, user 2, user 3 and user 4, and in fig. 3b, head images of the 4 users are respectively shown, and are respectively a user 1 head image, a user 2 head image, a user 3 head image and a user 4 head image in sequence. The audio attribute determined by the user side is assumed to be: user 1, volume 0; user 2, volume 50; user 3, volume 10; user 4, volume 0, because only the volume of user 2 exceeds the volume threshold, it can be determined that the currently speaking communication user is user 2, at this time, a small horn icon can be added to the head of user 2, and a volume column is added, where the current size of the volume column indicates that the volume is 50.
According to the multi-user wheat connecting method, the user side can receive the mixed stream pushed by the content distribution network, the mixed stream comprises the audio stream and the target video stream, the target video stream carries the audio attribute of the current time written by the content distribution network, so that after the user side receives the mixed stream pushed by the content distribution network, the user side can directly analyze the mixed stream to obtain the audio stream to be played and the audio attribute of the current time at the same time, identification is carried out in the current wheat connecting interface, different effects are rendered, and the current speaking wheat connecting user is indicated to a user watching wheat connecting. Therefore, the content distribution network transmits the audio attribute to the user terminal while transmitting the audio attribute to the user terminal by writing the audio attribute of the current time to the target video stream, so that the time delay of the audio attribute is avoided, the audio attribute acquired by the user terminal is completely synchronous with the audio stream, and the current speaking headset user identified in the current headset interface is ensured to be synchronous with the current real speaking headset user, thereby ensuring the use experience of the headset user.
The following is a description of the multi-person headset method provided by the present application, taking the application of the multi-person headset method in a dual-person pure audio headset scene as an example, with reference to fig. 4. Fig. 4 is a schematic diagram of a processing procedure of a multi-person headset method applied to a dual-person pure audio headset scene according to an embodiment of the present application, which specifically includes the following steps:
Taking a live broadcast scene as an example, a typical double-person continuous-wheat scene in pure audio live broadcast consists of a main cast and a continuous-wheat user, wherein a main cast stream pushed by the main cast and a continuous-wheat user stream pushed by the continuous-wheat user respectively enter a mixed-stream module.
The sound information acquisition sub-module in the mixed stream module can extract the audio attribute of the current time from the main broadcasting stream and the wheat-linked user stream; the video creation sub-module in the mixed stream module can create a mock video stream; the sound information inserting sub-module in the mixed stream module can insert the extracted audio attribute of the current time into the created mock video stream to obtain a target video stream; and a mixing sub-module of the mixed stream module mixes the main broadcast stream, the wheat-connected user stream and the target video stream to obtain a live broadcast stream, and pushes the live broadcast stream to a live broadcast room for audience to listen.
According to the multi-user wheat connecting method provided by the embodiment of the application, the content distribution network can analyze the audio streams of at least two wheat connecting users, determine the audio attribute of the current time, the audio attribute of the current time can indicate the wheat connecting user who is currently speaking, then write the audio attribute of the current time into the corresponding time of the target video stream, package and mix the audio stream and the target video stream, and push the audio stream and the target video stream to the user side, so that when the user side receives the mixed stream, the user side can directly analyze the mixed stream to obtain the audio stream to be played and the audio attribute of the current time at the same time, and therefore the audio attribute of the current time is identified in the current wheat connecting interface, and the user who is watching wheat connecting indicates the wheat connecting user who is currently speaking. Therefore, by writing the audio attribute of the current time into the target video stream, the audio attribute corresponding to the audio attribute is transferred to the user side while the audio stream is transferred to the user side, the time delay of the audio attribute is avoided, the audio attribute acquired by the user side is completely synchronous with the audio stream, the current speaking wheat connecting user identified in the current wheat connecting interface is ensured to be synchronous with the current actual speaking wheat connecting user, and therefore the use experience of the wheat connecting user is ensured.
Corresponding to the method embodiment, the present application further provides a content distribution network embodiment, and fig. 5 shows a schematic structural diagram of a content distribution network according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
a first obtaining module 502, configured to obtain multimedia streams corresponding to at least two users connected with wheat, where the multimedia streams include audio streams;
a determining module 504 configured to determine an audio attribute of a current time according to the audio streams of the respective microphone users, wherein the audio attribute of the current time is used for indicating the microphone users who speak currently;
a mixing module 506 configured to write the audio attribute of the current time to the corresponding time of the target video stream, and mix the audio stream and the target video stream to obtain a mixed stream;
and the pushing module 508 is configured to push the mixed stream to the user terminal, wherein the video stream in the mixed stream is used for indicating the user terminal to determine the audio attribute of the current time and identifying in the current wheat connecting interface.
Optionally, the apparatus further comprises a creation module configured to:
creating a virtual video stream, and taking the virtual video stream as a target video stream, wherein the virtual video stream is a video stream which does not contain effective video content.
Optionally, the creation module is further configured to:
pre-coding virtual video data with set duration;
and circularly copying the virtual video data to obtain a virtual video stream.
Optionally, the multimedia stream further comprises a video stream; the apparatus further includes a second acquisition module configured to:
acquiring a video stream from a multimedia stream;
and taking the obtained video stream as a target video stream.
Optionally, the mixing module 506 is further configured to:
and writing the audio attribute of the current time into the supplemental enhancement information SEI of the target video according to the corresponding time.
Optionally, the determining module 504 is further configured to:
acquiring a set insertion frequency, and determining a corresponding information acquisition frequency according to the set insertion frequency;
determining corresponding information acquisition time according to the information acquisition frequency;
and determining the audio attribute of the current time according to the audio stream of each wheat-connected user when the information acquisition time is reached.
Optionally, the determining module 504 is further configured to:
acquiring phoneme information of the current transmission audio from a target audio stream and a user identifier of a target link user corresponding to the target audio stream, wherein the target audio stream is an audio stream of any link user;
Determining volume information of the current transmission audio according to the phoneme information;
and taking the volume information and the user identification of the target headset user as the audio attribute of the current time.
According to the content distribution network provided by the embodiment of the invention, the audio streams of at least two continuous users can be analyzed, the audio attribute of the current time is determined, the audio attribute of the current time can indicate the continuous users who speak currently, then the audio attribute of the current time is written into the corresponding time of the target video stream, the audio stream and the target video stream are packed and mixed, and meanwhile, the audio stream and the target video stream are pushed to the user side, so that when the user side receives the mixed stream, the audio stream to be played and the audio attribute of the current time can be obtained simultaneously by directly analyzing the mixed stream, and the continuous users who speak currently are indicated to the user who views continuous wheat by marking in the current continuous wheat interface. Therefore, by writing the audio attribute of the current time into the target video stream, the audio attribute corresponding to the audio attribute is transferred to the user side while the audio stream is transferred to the user side, the time delay of the audio attribute is avoided, the audio attribute acquired by the user side is completely synchronous with the audio stream, the current speaking wheat connecting user identified in the current wheat connecting interface is ensured to be synchronous with the current actual speaking wheat connecting user, and therefore the use experience of the wheat connecting user is ensured.
The above is an exemplary scheme of a content distribution network of the present embodiment. It should be noted that, the technical solution of the content distribution network and the technical solution of the above-mentioned multi-person wheat connecting method belong to the same conception, and the details of the technical solution of the content distribution network which are not described in detail can be referred to the description of the technical solution of the above-mentioned multi-person wheat connecting method.
Corresponding to the above method embodiment, the present application further provides a client embodiment, and fig. 6 shows a schematic structural diagram of a client according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:
the receiving module 602 is configured to receive a mixed stream pushed by the content distribution network, where the mixed stream includes an audio stream and a target video stream, the target video stream carries an audio attribute of a current time written by the content distribution network, the audio attribute of the current time is determined based on the audio streams of the respective headset users, and the audio attribute of the current time is used to indicate the headset users who speak currently;
an extracting module 604 configured to obtain a target video stream from the mixed stream and extract an audio attribute of a current time from the target video stream;
an identification module 606 is configured to identify an audio attribute of the current time in the current wheat-along interface.
Optionally, the audio attribute includes volume information and a user identifier corresponding to the communication user; an identification module 606, further configured to:
determining a target user head portrait corresponding to the wheat connecting user in the current wheat connecting interface according to the user identification;
and adding sound production icons and/or volume icons corresponding to the volume information on the target user head portrait.
The user terminal provided by the embodiment of the invention can receive the mixed stream pushed by the content distribution network, the mixed stream comprises an audio stream and a target video stream, and the target video stream carries the audio attribute of the current time written by the content distribution network, so that after the user terminal receives the mixed stream pushed by the content distribution network, the user terminal can directly analyze the mixed stream to obtain the audio stream to be played and the audio attribute of the current time at the same time, thereby carrying out identification in the current headset interface, rendering different effects, and indicating the headset user who is currently speaking to the user who watches the headset. Therefore, the content distribution network transmits the audio attribute to the user terminal while transmitting the audio attribute to the user terminal by writing the audio attribute of the current time to the target video stream, so that the time delay of the audio attribute is avoided, the audio attribute acquired by the user terminal is completely synchronous with the audio stream, and the current speaking headset user identified in the current headset interface is ensured to be synchronous with the current real speaking headset user, thereby ensuring the use experience of the headset user.
The foregoing is a schematic solution of the user side in this embodiment. It should be noted that, the technical solution of the user side and the technical solution of the above-mentioned multi-person wheat connecting method belong to the same conception, and the details of the technical solution of the user side, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned multi-person wheat connecting method.
Corresponding to the method embodiment, the present application further provides a multi-person wheat connecting system embodiment, and fig. 7 shows a schematic structural diagram of a multi-person wheat connecting system according to an embodiment of the present application. As shown in fig. 7, the multiplayer communication system includes a content distribution network 702 and a client 704;
a content distribution network 702 configured to obtain multimedia streams corresponding to at least two users connected to wheat, wherein the multimedia streams include audio streams; determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently; writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream; pushing the mixed stream to a user side;
a client 704 configured to receive a mixed stream pushed by the content distribution network; obtaining a target video stream from the mixed stream, and extracting the audio attribute of the current time from the target video stream; the audio attribute of the current time is identified in the current headset interface.
In the multi-user headset system provided by the embodiment of the invention, the content distribution network can analyze the audio streams of at least two headset users, determine the audio attribute of the current time, the audio attribute of the current time can indicate the headset user who speaks currently, then write the audio attribute of the current time into the corresponding time of the target video stream, package and mix the audio stream and the target video stream, and push the audio stream to the user terminal simultaneously, so that when the user terminal receives the mixed stream, the user terminal can directly analyze the mixed stream to obtain the audio stream to be played simultaneously, and the audio attribute of the current time, thereby marking the current headset interface to indicate the headset user who speaks currently to the user who watches the headset. Therefore, by writing the audio attribute of the current time into the target video stream, the audio attribute corresponding to the audio attribute is transferred to the user side while the audio stream is transferred to the user side, the time delay of the audio attribute is avoided, the audio attribute acquired by the user side is completely synchronous with the audio stream, the current speaking wheat connecting user identified in the current wheat connecting interface is ensured to be synchronous with the current actual speaking wheat connecting user, and therefore the use experience of the wheat connecting user is ensured.
The above is a schematic scheme of a multi-person wheat connecting system of this embodiment. It should be noted that, the technical solution of the multi-person wheat connecting system and the technical solution of the multi-person wheat connecting method belong to the same conception, and details of the technical solution of the multi-person wheat connecting system which are not described in detail can be referred to the description of the technical solution of the multi-person wheat connecting method.
FIG. 8 illustrates a block diagram of a computing device provided in accordance with an embodiment of the present application. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. Access device 840 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network Interface Controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Networks) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, a near field communication (NFC, near Field Communication) interface, and so forth.
In one embodiment of the present application, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein the processor 820 is configured to execute the following computer-executable instructions to implement the operational steps of any of the multi-person wheat-head methods described above.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the above-mentioned multi-person wheat connecting method belong to the same conception, and the details of the technical solution of the computing device which are not described in detail can be referred to the description of the technical solution of the above-mentioned multi-person wheat connecting method.
An embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, are configured to implement the operational steps of any of the above-described multiplayer wheat connection methods.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above-mentioned multi-person wheat connecting method belong to the same conception, and the details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the above-mentioned multi-person wheat connecting method.
The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code which may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all necessary for the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The above-disclosed preferred embodiments of the present application are provided only as an aid to the elucidation of the present application. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of this application. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This application is to be limited only by the claims and the full scope and equivalents thereof.

Claims (14)

1. A multi-person wheat connecting method, applied to a content distribution network, comprising:
acquiring multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams;
determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently;
writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream;
pushing the mixed stream to a user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time and identifying the audio attribute in the current wheat connecting interface.
2. The multi-person headset method of claim 1, wherein after determining the audio attribute of the current time from the audio streams of the respective headset users, further comprising:
creating a virtual video stream, and taking the virtual video stream as the target video stream, wherein the virtual video stream is a video stream which does not contain effective video content.
3. The multi-person communication method according to claim 2, wherein the creating a virtual video stream comprises:
pre-coding virtual video data with set duration;
and circularly copying the virtual video data to obtain the virtual video stream.
4. The multi-person headset method of claim 1, wherein the multimedia stream further comprises a video stream; after determining the audio attribute of the current time according to the audio streams of all the wheat-connected users, the method further comprises the following steps:
acquiring a video stream from the multimedia stream;
and taking the obtained video stream as the target video stream.
5. The multi-person headset method of any of claims 1-4, wherein the writing the audio attribute of the current time to the target video stream at the corresponding time comprises:
And writing the audio attribute of the current time into the supplemental enhancement information SEI of the target video according to the corresponding time.
6. The multi-person headset method of any of claims 1-4, wherein determining the audio attribute for the current time based on the audio streams of each headset user comprises:
acquiring a set insertion frequency, and determining a corresponding information acquisition frequency according to the set insertion frequency;
determining corresponding information acquisition time according to the information acquisition frequency;
and determining the audio attribute of the current time according to the audio stream of each wheat-connected user when the information acquisition time is reached.
7. The multi-person headset method of any of claims 1-4, wherein determining the audio attribute for the current time based on the audio streams of each headset user comprises:
acquiring phoneme information of current transmission audio from a target audio stream and a user identifier of a target link user corresponding to the target audio stream, wherein the target audio stream is an audio stream of any link user;
determining volume information of the current transmission audio according to the phoneme information;
And taking the volume information and the user identification of the target link user as the audio attribute of the current time.
8. The multi-person wheat connecting method is characterized by being applied to a user side and comprising the following steps of:
receiving a mixed stream pushed by a content distribution network, wherein the mixed stream comprises an audio stream and a target video stream, the target video stream carries audio attributes of the current time written by the content distribution network, the audio attributes of the current time are determined based on the audio streams of all the wheat-connected users, and the audio attributes of the current time are used for indicating the wheat-connected users who speak currently;
acquiring the target video stream from the mixed stream, and extracting the audio attribute of the current time from the target video stream;
and identifying the audio attribute of the current time in the current wheat connecting interface.
9. The multi-person headset method of claim 8, wherein the audio attributes include volume information and user identification of a corresponding headset user; the identifying the audio attribute of the current time in the current wheat connecting interface comprises the following steps:
determining a target user head portrait of the corresponding wheat connecting user in the current wheat connecting interface according to the user identification;
And adding sounding icons and/or volume icons corresponding to the volume information on the target user head portrait.
10. A content distribution network, comprising:
the first acquisition module is configured to acquire multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams;
the determining module is configured to determine the audio attribute of the current time according to the audio streams of all the wheat-connected users, wherein the audio attribute of the current time is used for indicating the wheat-connected users who speak currently;
the mixing module is configured to write the audio attribute of the current time into the corresponding time of the target video stream, and mix the audio stream and the target video stream to obtain a mixed stream;
and the pushing module is configured to push the mixed stream to a user side, wherein the video stream in the mixed stream is used for indicating the user side to determine the audio attribute of the current time and is identified in a current wheat connecting interface.
11. A client, comprising:
the receiving module is configured to receive a mixed stream pushed by a content distribution network, wherein the mixed stream comprises an audio stream and a target video stream, the target video stream carries audio attributes of the current time written by the content distribution network, the audio attributes of the current time are determined based on the audio streams of all the wheat-connected users, and the audio attributes of the current time are used for indicating the wheat-connected users who speak currently;
An extracting module configured to obtain the target video stream from the mixed stream, and extract the audio attribute of the current time from the target video stream;
an identification module configured to identify an audio attribute of the current time in a current wheat-along interface.
12. The multi-person wheat connecting system is characterized by comprising a content distribution network and a user side;
the content distribution network is configured to acquire multimedia streams corresponding to at least two wheat-connected users, wherein the multimedia streams comprise audio streams; determining the audio attribute of the current time according to the audio streams of all the wheat connecting users, wherein the audio attribute of the current time is used for indicating the wheat connecting users who speak currently; writing the audio attribute of the current time into the corresponding time of the target video stream, and mixing the audio stream and the target video stream to obtain a mixed stream; pushing the mixed stream to a user side;
the user terminal is configured to receive the mixed stream pushed by the content distribution network; acquiring the target video stream from the mixed stream, and extracting the audio attribute of the current time from the target video stream; and identifying the audio attribute of the current time in the current wheat connecting interface.
13. A computing device, comprising:
a memory and a processor;
the memory is for storing computer executable instructions and the processor is for executing the computer executable instructions to implement the operational steps of the multiplayer wheat connection method of any of the above claims 1-7 or claims 8-9.
14. A computer-readable storage medium, characterized in that it stores computer-executable instructions that, when executed by a processor, implement the operational steps of the multiplayer wheat connection method of any of the above claims 1-7 or claims 8-9.
CN202211636748.1A 2022-12-15 2022-12-15 Multi-person wheat connecting method and device Pending CN116112718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211636748.1A CN116112718A (en) 2022-12-15 2022-12-15 Multi-person wheat connecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211636748.1A CN116112718A (en) 2022-12-15 2022-12-15 Multi-person wheat connecting method and device

Publications (1)

Publication Number Publication Date
CN116112718A true CN116112718A (en) 2023-05-12

Family

ID=86257218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211636748.1A Pending CN116112718A (en) 2022-12-15 2022-12-15 Multi-person wheat connecting method and device

Country Status (1)

Country Link
CN (1) CN116112718A (en)

Similar Documents

Publication Publication Date Title
CN107979763B (en) Virtual reality equipment video generation and playing method, device and system
US11025967B2 (en) Method for inserting information push into live video streaming, server, and terminal
CN109951743A (en) Barrage information processing method, system and computer equipment
CN110784730B (en) Live video data transmission method, device, equipment and storage medium
CN112399133A (en) Conference sharing method and device
US10679675B2 (en) Multimedia file joining method and apparatus
CN109547724B (en) Video stream data processing method, electronic equipment and storage device
CN112637670B (en) Video generation method and device
CN111583415B (en) Information processing method and device and electronic equipment
CN102984496A (en) Processing method, device and system of video and audio information in video conference
CN112492347A (en) Method for processing information flow and displaying bullet screen information and information flow processing system
CN110933485A (en) Video subtitle generating method, system, device and storage medium
CN112423081A (en) Video data processing method, device and equipment and readable storage medium
CN111629222B (en) Video processing method, device and storage medium
CN113301359A (en) Audio and video processing method and device and electronic equipment
CN106507115B (en) Coding/decoding method, device and the terminal device of VR video based on iOS device
CN114531564A (en) Processing method and electronic equipment
CN110276999A (en) A kind of remote interactive teaching system and method with synchronous writing on the blackboard and direct broadcast function
CN113038185B (en) Bullet screen processing method and device
CN114286021A (en) Rendering method, rendering apparatus, server, storage medium, and program product
WO2024087732A1 (en) Livestreaming data processing method and system
CN113630620A (en) Multimedia file playing system, related method, device and equipment
CN116112718A (en) Multi-person wheat connecting method and device
CN115767206A (en) Data processing method and system based on augmented reality
CN114554277B (en) Multimedia processing method, device, server and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination