CN112019874A - Live wheat-connecting method and related equipment - Google Patents

Live wheat-connecting method and related equipment Download PDF

Info

Publication number
CN112019874A
CN112019874A CN202010940695.7A CN202010940695A CN112019874A CN 112019874 A CN112019874 A CN 112019874A CN 202010940695 A CN202010940695 A CN 202010940695A CN 112019874 A CN112019874 A CN 112019874A
Authority
CN
China
Prior art keywords
target
live broadcast
audio
terminal
tone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010940695.7A
Other languages
Chinese (zh)
Other versions
CN112019874B (en
Inventor
黄杰雄
黄青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Cubesili Information Technology Co Ltd
Original Assignee
Guangzhou Huaduo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huaduo Network Technology Co Ltd filed Critical Guangzhou Huaduo Network Technology Co Ltd
Priority to CN202111073971.5A priority Critical patent/CN113784163B/en
Priority to CN202010940695.7A priority patent/CN112019874B/en
Publication of CN112019874A publication Critical patent/CN112019874A/en
Application granted granted Critical
Publication of CN112019874B publication Critical patent/CN112019874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • H04N21/2335Processing of audio elementary streams involving reformatting operations of audio signals, e.g. by converting from one coding standard to another
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application provides a live broadcast method with continuous wheat and related equipment, comprising the following steps: in the process of live broadcast of multiple terminals connected with a microphone, if any terminal triggers a variable-sound live broadcast mode, acquiring the original audio input by the anchor on the basis of the terminal in real time and the target tone selected by the anchor on the basis of the terminal; performing tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio; and mixing the target audio with the acquired original audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to all the terminals connected with the microphone and a spectator end entering a live broadcast room connected with the microphone. In the scheme, the server performs tone conversion on original audio input in real time by the terminal triggering the sound change live broadcast mode to obtain target audio. To facilitate viewing by viewers entering the live room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.

Description

Live wheat-connecting method and related equipment
Technical Field
The application relates to the technical field of network live broadcast, in particular to a live broadcast method with wheat and related equipment.
Background
With the development of internet live broadcast technology, live broadcast modes are more and more, so that the selection of users is more and more extensive. How to increase the user stickiness becomes an important issue for the operation of the internet live broadcast platform.
In the prior art, a user is attracted by a mode of a plurality of anchor broadcasters connecting with the live broadcast of wheat, because the environment of the anchor broadcasters connecting with the live broadcast of wheat is single and the live broadcast content is boring, the live broadcast experience of the user is poor due to the mode of simply connecting with the live broadcast of wheat, and the liveness of the user and the stickiness of the live broadcast platform are not high.
Disclosure of Invention
In view of this, embodiments of the present application provide a live broadcasting method and related device for live broadcasting, so as to solve the problems in the prior art that a user has poor experience of watching live broadcasting, and the liveness of the user and the stickiness of a live broadcasting platform are not high.
In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:
a first aspect of the present application shows a live wheat-connecting method, which is applicable to a server, and the method includes:
in the process of carrying out live broadcasting with wheat by a plurality of terminals, if any terminal triggers a variable-sound live broadcasting mode, acquiring original audio input by a main broadcast in real time based on the terminal and target tone selected by the main broadcast based on the terminal, wherein the original audio comprises voice content and original tone;
performing tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio, wherein the target audio is composed of the target tone and the voice content;
and mixing the target audio with the acquired original audio input by other connected microphone terminals to obtain mixed audio, and sending the mixed audio to all connected microphone terminals and a spectator end entering a connected microphone live broadcast room.
Optionally, in the process of live broadcasting by connecting multiple terminals with microphones, the method further includes:
determining whether a live broadcast room with the current live broadcast is in a live broadcast PK mode with the live broadcast;
if yes, when the live broadcasting PK mode is finished, acquiring live broadcasting data of all terminals for carrying out live broadcasting PK through live broadcasting;
determining a target terminal and other terminals based on the live broadcast data of all the terminals, wherein the target terminal is used for indicating the terminal with the highest live broadcast preference, and the other terminals are terminals with the live broadcast preference lower than the highest live broadcast preference;
sending the permission of switching the live broadcast modes of other terminals to the target terminal;
if a mode switching request for switching the live broadcast modes of other terminals, which is sent by the target terminal, is received, switching the live broadcast modes of the other terminals into a sound-changing live broadcast mode based on the mode switching request, so that the other terminals are in the sound-changing live broadcast mode within a preset time length, wherein the mode switching request carries a target tone corresponding to the sound-changing live broadcast mode;
acquiring original audio input by the other terminals in real time;
performing tone conversion on the original audio based on the target tone carried by the mode switching request to obtain a converted target audio;
and mixing the target audio, the acquired original audio input by the target terminal and the acquired original audio input by other terminals which are connected with the microphone but are not in live broadcast PK, and sending the mixed audio to all terminals connected with the microphone and a spectator end entering a live broadcast room of the microphone.
Optionally, in the process of live broadcasting by connecting multiple terminals with microphones, the method further includes:
acquiring user data of a spectator end initiating a microphone connecting request in the microphone connecting live broadcast room;
judging the direct broadcasting room permission type of the audience terminal based on the user data of the audience terminal;
if the audience has special authority, acquiring an original audio input by the audience in real time based on the audience and a target tone selected by the audience based on the audience;
performing tone conversion on the original audio based on the target tone to obtain a converted target audio;
and mixing the target audio with the acquired audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to all the terminals connected with the microphone and a spectator end entering a live microphone room.
Optionally, the method further includes:
if the audience terminal has common authority, acquiring an original audio input by the audience terminal in real time based on the audience terminal and a target tone selected by the audience terminal based on the audience terminal;
performing tone conversion on the original audio based on the target tone to obtain a converted target audio;
and transmitting the target audio to the audience.
Optionally, the performing, based on the target timbre, timbre conversion on the original timbre in the original audio to obtain a converted target audio includes:
performing tone conversion on the original audio by using a tone conversion network to obtain a converted target audio, wherein the target audio consists of the target tone and the voice content;
the tone conversion network is constructed in advance by a voice content identification model, a voice speaker identification model, a tone conversion model and a vocoder model, and the original audio is input into the voice content identification model for processing to obtain a content characteristic matrix; inputting the target audio frequency into the voice speaker recognition model for processing to obtain a voice information characteristic matrix; taking the combined matrix of the content characteristic matrix and the voice information characteristic matrix as the input of the tone conversion model, and processing the combined matrix of the content characteristic matrix and the voice information characteristic matrix based on the tone conversion model to obtain acoustic characteristics, wherein the tone conversion model is constructed by a separation gate convolution layer, a bidirectional long-time memory network and a full connection layer; and inputting the acoustic features into the vocoder model for audio conversion to obtain a converted target audio, wherein the target audio consists of the target tone and the voice content.
A second aspect of the present application shows a live wheat-connecting device, the device comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an original audio frequency input by a main broadcast in real time based on a terminal and a target tone selected by the main broadcast based on the terminal when any terminal triggers a variable-sound live broadcast mode in the live broadcast process of connecting multiple terminals with microphones, and the original audio frequency comprises voice content and an original tone;
the tone conversion network is used for carrying out tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio, and the target audio consists of the target tone and the voice content;
and the first sending module is used for mixing the target audio with the acquired original audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to the other terminals connected with the microphone and a spectator end entering a live microphone room.
Optionally, the voice content recognition model, the voice speaker recognition model, the voice conversion model and the vocoder model are pre-constructed to form the voice conversion network;
the voice content recognition model is used for processing the input original audio to obtain a content characteristic matrix;
the voice speaker recognition model is used for processing the target audio to obtain a voice information characteristic matrix;
the tone conversion model is used for processing the input combination matrix of the content characteristic matrix and the voice information characteristic matrix to obtain acoustic characteristics, and is constructed by a separation gate convolution layer, a bidirectional long-term memory network and a full connection layer;
the vocoder model is used for converting the acoustic features into target audio, and the target audio is composed of the target tone and the voice content.
Optionally, the method further includes:
the second acquisition module is used for determining that a live broadcast room with the connected microphones is in a microphone live broadcast PK mode in the process of live broadcast of the connected microphones of the terminals, acquiring live broadcast data of all terminals carrying out microphone live broadcast PK when the microphone live broadcast PK mode is finished, and acquiring original audio input by other terminals in real time;
the first determining module is used for determining a target terminal and other terminals based on the live broadcast data of all the terminals, wherein the target terminal is used for indicating the terminal with the highest live broadcast preference, and the other terminals are terminals with the live broadcast preference lower than the highest live broadcast preference;
the second sending module is used for sending the permission of switching the live broadcast modes of other terminals to the target terminal;
the switching module is used for switching the live broadcast mode of the other terminal into the sound-changing live broadcast mode based on the mode switching request if the mode switching request for switching the live broadcast mode of the other terminal sent by the target terminal is received, so that the other terminal is in the sound-changing live broadcast mode within a preset time length, and the mode switching request carries a target tone corresponding to the sound-changing live broadcast mode;
the tone conversion network is used for carrying out tone conversion on the original audio based on the target tone carried by the mode switching request to obtain a converted target audio;
and the third sending module is used for mixing the target audio, the obtained original audio input by the target terminal and the obtained original audio input by other terminals which are connected with the microphones but are not subjected to live PK, sending the mixed audio to all terminals connected with the microphones and entering a spectator end of a live microphone room.
A third aspect of the present application shows an electronic device, which includes a processor and a memory, where the memory is used to store program codes and data of voice tone conversion, and the processor is used to call program instructions in the memory to execute a live telecast method as shown in the first aspect of the present application.
A fourth aspect of the present application shows a storage medium, where the storage medium includes a storage program, and when the program runs, a device where the storage medium is located is controlled to execute a live wheat-connecting method as shown in the first aspect of the present application.
Based on the method and the device for live broadcast with continuous wheat provided by the embodiment of the application, the method comprises the following steps: in the process of live broadcast by connecting multiple terminals with wheat, if any terminal triggers a variable-sound live broadcast mode, acquiring original audio input by a main broadcast in real time based on the terminal and target tone selected by the main broadcast based on the terminal, wherein the original audio comprises voice content and original tone; performing tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio, wherein the target audio consists of the target tone and the voice content; and mixing the target audio with the acquired original audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to all the terminals connected with the microphone and a spectator end entering a live broadcast room connected with the microphone. In the embodiment of the application, the server performs tone conversion on the original audio input by the terminal triggering the sound change live broadcast mode in real time to obtain the target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram of an application architecture for a plurality of terminals and servers provided herein;
fig. 2 is a schematic flow chart of a live webcast method according to an embodiment of the present application;
fig. 3 is an architecture diagram of a tone conversion network according to an embodiment of the present application;
fig. 4 is a schematic flow chart of another live webcast method according to an embodiment of the present disclosure;
fig. 5 is a schematic view of a live PK mode of two terminals according to an embodiment of the present application;
fig. 6 is a schematic flow chart of another live webcast method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a live wheat-connecting device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of another wheat-connected live broadcasting device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of another wheat-connected live broadcasting device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the embodiment of the application, the server performs tone conversion on the original audio input by the terminal triggering the sound change live broadcast mode in real time to obtain the target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Fig. 1 is a diagram illustrating an application architecture of a plurality of terminals and servers according to the present invention.
The plurality of terminals include a terminal 12, a terminal 13, a terminal 14, a terminal 15, and a terminal 16.
When the anchor a carries out live broadcasting through the terminal 12, the terminal 12 for live broadcasting is the anchor terminal at this time; when the anchor b performs live broadcasting through the terminal 13, the terminal 13 used for live broadcasting is the anchor terminal at this time.
The processing process for realizing the live broadcast based on the application architecture comprises the following steps:
the anchor a performs live broadcast through the server 11 and the terminal 13 based on the terminal 12.
Wherein, the number of the terminals for live broadcasting is at least 2.
When the anchor a triggers the sound-changing live broadcast mode based on the terminal 12, the terminal 12 displays a user operation interface corresponding to the sound-changing live broadcast mode, wherein the user operation interface comprises a target tone selection module and an original audio input module.
The target tone color selection module is used for displaying selectable audio, and the original audio input module is connected with a microphone of the terminal 12 and is used for receiving the original audio input by the main player through the microphone.
The anchor a selects the audio based on the target tone color selection module of the terminal 12 and determines the target tone color corresponding to the audio. Anchor a also receives the original audio input from the microphone corresponding to terminal 12 based on the original audio input module of terminal 12. The original audio and the target tone are transmitted to the server 11 based on the terminal 12.
The server 11 acquires the original audio of the anchor a based on the input of the terminal 12 and the selected target tone color.
Wherein the original audio comprises speech content and an original timbre.
The server 11 performs tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio, where the target audio is composed of the target tone and the voice content.
The server 11 obtains the original audio input by the terminal 13 connected with the microphone, and mixes the converted target audio with the original audio input by the terminal 13 to obtain mixed audio.
The audience watches the live broadcast content of the live broadcast rooms of the terminals 12 and 13 through the terminal 14, at this time, the terminal 14 for watching the live broadcast is the audience, and when other users use the terminal 15, the terminal 16 or other terminals to watch the live broadcast of the live broadcast rooms, other terminals can also be used as the audience.
The server 11 sends the mixed audio to the terminal 12, the terminal 13 and the audience 14 entering the live webcast room.
In the embodiment of the application, the server performs tone conversion on the original audio input by the terminal triggering the sound change live broadcast mode in real time to obtain the target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Based on the processing architecture disclosed in the embodiment of the present application, referring to fig. 2, a flowchart of a live broadcast method with live broadcast is shown in the embodiment of the present application, where the method is applied to a server, and the method includes:
step S201: and in the process of live broadcast of multiple terminals in a microphone connection mode, judging whether any terminal triggers a voice-changing live broadcast mode, if so, executing the step S202, and if not, continuing to carry out the microphone connection live broadcast.
In the specific implementation process of step S201, in the process of live broadcasting by multiple terminals connected to the microphone, the server determines in real time whether any terminal in the terminals connected to the microphone triggers a variable-voice live broadcasting mode, if at least one terminal triggers the variable-voice live broadcasting mode, it indicates that there is a terminal performing live broadcasting by connecting to the microphone through the variable-voice mode, and step S202 is executed, if there is no terminal triggering the variable-voice live broadcasting mode, all terminals performing live broadcasting by connecting to the microphone continue to perform live broadcasting by connecting to the microphone.
Step S202: and acquiring the original audio input by the anchor in real time based on the terminal, and the target timbre selected by the anchor based on the terminal.
In step S202, the original audio includes a speech content and an original tone color.
Optionally, the anchor is a server for inputting the original audio in real time based on a microphone of the terminal and the target tone selected based on a user operation interface of the terminal, and sending the original audio and the target tone.
In the process of implementing step S202 specifically, the server acquires the speech content and the original tone in the original audio, and the target tone.
It should be noted that the original audio refers to a voice input by the terminal microphone in real time by the anchor in the process of live broadcasting by the terminal.
Step S203: and performing tone conversion on the original tone in the original audio based on the target tone to obtain the converted target audio.
In step S203, the target audio is composed of the target tone and the speech content.
In the process of implementing step S203 specifically, the original tone in the original audio is subjected to tone conversion, so that the converted tone is the same as the target tone, thereby determining that the target audio is composed of the target tone and the speech content of the original audio.
Step S204: and mixing the target audio with the acquired original audio input by the other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to the other terminals connected with the microphone and a spectator end entering a live room with the microphones at the plurality of terminals.
In the process of implementing step S204 specifically, the target audio is mixed with the obtained multiple channels of audio, such as the original audio input by other terminals connected to the microphone, by using the mixed-flow technique, to generate a channel of audio stream, i.e., mixed-flow audio, and the mixed-flow audio is sent to all terminals connected to the microphone and to the audience terminal in the live broadcast connected to the microphone.
It should be noted that the connected-to-many terminal includes a connected-to-many inviter and an invitee, and a cast terminal with a change of voice.
Optionally, when mixed-flow audio is obtained, the obtained mixed-flow audio and a video picture of the terminal need to be aligned by using a mixed-flow technology to form audio and video streams, and the audio and video streams are sent to other terminals connected with the microphone and a viewer end entering a live broadcast room with multiple terminals connected with the microphone.
The audio stream is output to other terminals connected to the same network and to viewers who enter a live broadcast room with a plurality of terminals, in a stable and continuous manner. The audio and video stream refers to that audio and video can be stably and continuously output to other terminals connected with the microphone and audience terminals entering a plurality of terminals connected with the microphone live broadcast rooms.
The mixed flow technique refers to a technique of combining multiple audio/video data.
In the embodiment of the application, the server performs tone conversion on the original audio input by the terminal triggering the sound change live broadcast mode in real time to obtain the target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Optionally, based on the foregoing live broadcasting method with wheat live broadcasting shown in this application, in the process of executing step S203 to perform tone conversion on the original tone in the original audio based on the target tone, and obtain a converted target audio, the method includes:
and performing tone conversion on the original audio by using a tone conversion network to obtain a converted target audio.
It should be noted that the target audio is composed of a target tone and a speech content.
In specific implementation, an input original audio and a target tone are processed by using a tone conversion network, the original tone in the original audio is converted into the target tone through the tone conversion network disclosed in the embodiment of the application, and finally the target audio composed of the target tone and voice content is output.
The tone conversion network is constructed in advance by a voice content recognition model, a voice speaker recognition model, a tone conversion model and a vocoder model.
It should be noted that the process of constructing the tone conversion network in advance by the speech content recognition model, the speech speaker recognition model, the tone conversion model and the vocoder model includes the following steps:
step S11: and training based on the first data set to obtain a speech content recognition model and a speech speaker recognition model.
It is noted that the first data set is a high quality speech data set.
Alternatively, the first data set may be an open-source speech data set libristech comprising 2400 multiple timbres and a high quality speech data set having a total duration of 1000 hours or more.
In the process of implementing step S11, when training the speech content recognition model, first, the audio content expressed by each sentence of speech in the first data set is extracted. Then, the classification into different categories is performed according to a preset rule. And finally, training different types of audio contents by utilizing a recurrent neural network model to obtain a speech content recognition model. The voice content recognition model can be used for accurately recognizing audio content from any human tone, namely the audio content of original audio input by a user.
In training the speech speaker recognition model, first, each tone of each sentence of audio in the first data set is extracted. Then, each tone is trained by using a recurrent neural network model to obtain a speaker recognition model. The voice speaker recognition model is used for accurately recognizing speaker information from audio, namely the tone of the audio.
It should be noted that, if the type of the audio content is english, the preset rule may be set as a phoneme of an english pronunciation; if the type of the audio content is Chinese, the preset rule can be set as initial consonant and vowel of the pinyin, and the vowel is tonal. If the type of the audio content is other types of foreign languages or dialects, the preset rule may be set through the pronunciation manner of other types of foreign languages or dialects, which is not limited in the embodiment of the present application.
The matrix dimension of the voice content feature matrix corresponding to each sentence of audio is T256, and T is the length of each sentence of audio. Wherein, each matrix of T256 represents the phoneme content of the audio with the time length of T at each moment.
The matrix dimension of the speech information feature matrix of each tone is 1 × 256. In the embodiment of the application, the matrix dimension of the voice information feature matrix is copied according to the length T of the audio, so as to obtain a voice information feature matrix with T × 256 dimensions.
It should be noted that the recurrent neural network model is one of the neural network models, in which the connection of some neurons forms a directed loop, so that an internal state or a structure with memory appears in the recurrent neural network model, so that there is an ability to model a dynamic sequence.
In the embodiment of the present application, the speech content recognition model and the speech speaker recognition model may be constructed by using a recurrent neural network model, or may be constructed by using other neural network models or machine learning models, and the like, in addition to the speech content recognition model and the speech speaker recognition model, which is not limited in the embodiment of the present application.
Step S12: the vocoder model is trained based on the second data set.
It should be noted that the second data set refers to a high quality audio data set.
Optionally, the second data set may be an open-source audio data set LibriTTS, which contains 2400 multiple timbres and a high-quality audio data set with a total audio duration of 500 hours or more.
In the process of implementing step S12, when training the vocoder model, 20-dimensional acoustic features of each sentence of speech in the second data set are extracted first. The vocoder model is then trained sufficiently using the 20-dimensional acoustic features of each sentence of speech to obtain a vocoder model.
Step S13: and training based on the second data set to obtain a tone conversion model.
It should be noted that, the process of obtaining the tone conversion model based on the second data set training specifically includes the following steps:
step S21: and inputting the second data set into the speech content recognition model to obtain a speech content characteristic matrix corresponding to the audio, and inputting the second data set into the speech speaker recognition model to obtain a speech information characteristic matrix corresponding to the audio.
In the process of implementing step S21 specifically, each sentence of audio in the second data set is identified by using the trained speech content identification model, so as to extract a speech content feature matrix corresponding to each sentence of audio; and recognizing each sentence of audio in the second data set by using the trained speaker recognition model so as to extract the corresponding voice information characteristic matrix of each sentence of audio.
Step S22: and constructing an initial tone conversion model based on the split-gate convolution layer, the two-way long-time and short-time memory network and the full connection layer.
In the process of implementing step S22, an initial tone conversion model is built using N discrete gate convolution layers, M bidirectional long-term and short-term memory networks, and 1 full connection layer.
The N separation gate convolution layers comprise N separation gate convolution layers, namely a separation gate convolution layer 1 and a separation gate convolution layer 2 …; the M bidirectional long-short term memory networks comprise M bidirectional long-short term memory networks from the bidirectional long-short term memory network 1 to the bidirectional long-short term memory network 2.
N and M are positive integers of 1 or more.
Optionally, if the values of N and M are equal, the calculation amount of the tone conversion model identification is increased, but when N and M are increased to a certain value, the conversion effect of the tone conversion model is decreased. In order to achieve a better recognition conversion effect of the tone conversion model and determine an optimal calculation amount, multiple experiments need to be performed on the conversion effect and the calculation amount of the tone conversion model. Therefore, after a plurality of experiments, according to the conversion calculation amount and the conversion effect of the tone conversion model, it is preferable to set N to 4 and M to 2.
In practical applications, the setting of N and M may also be performed according to the experience of a skilled person.
Step S23: and inputting the voice content characteristic matrix and the voice information characteristic matrix into a separation gate convolution layer in the initial voice conversion model for characteristic learning to obtain a first characteristic matrix.
In the process of implementing step S23, feature learning is performed on the input speech content feature matrix and the speech information feature matrix in sequence by using N split-gate convolution layers, so as to obtain a first feature matrix.
Step S24: and training the first feature matrix by using a bidirectional long-time and short-time memory network to obtain a second feature matrix.
In the embodiment of the present application, before training data output by the rolling layer of the split gate, a bidirectional long-short term memory network needs to be trained to obtain the bidirectional long-short term memory network, where it should be noted that the bidirectional long-short term memory network belongs to a neural network model.
In the process of implementing step S24 specifically, a first bidirectional long-short term memory network of M bidirectional long-short term memory networks into which data output by the split-gate convolutional layer is input is trained, and the trained first feature matrix is input into a next bidirectional long-short term memory network until the mth bidirectional long-short term memory network trains the trained first feature matrix output by the previous bidirectional long-short term memory network, so as to obtain a second feature matrix.
It should be noted that the bidirectional long-term memory network can input complete past and future context information for each node in the output layer input sequence.
Step S25: and the full connection layer performs nonlinear combination on the second feature matrix and outputs the predicted acoustic features of the timbre of the target person.
In the process of implementing step S25, the full link layer performs nonlinear combination on the second feature matrix obtained after training each sentence of audio through the split gate convolution layer and the two-way long-and-short-term memory network, and outputs the predicted acoustic features of the target person timbre.
It should be noted that each node of the fully connected layer is connected to all nodes of the previous layer for integrating the extracted features.
Step S26: and judging whether the absolute difference value of the calculated and predicted acoustic feature of the target person timbre and the target acoustic feature is within a preset range. If the absolute difference is within the preset range, go to step S27, and if the absolute difference is outside the preset range, go to step S28.
In the process of step S26, it is determined whether the absolute difference value of the acoustic feature of the measured target person timbre and the target acoustic feature, that is, the loss function, is within a preset range. If the current time is within the preset range, the step S27 is executed, and if the current time is outside the preset range, the step S28 is executed.
Step S27: and determining the current initial tone conversion model as the tone conversion model.
Step S28: and carrying out iterative calculation on the absolute difference value until the absolute difference value is within a preset range, and obtaining a trained tone conversion model.
In the embodiment of the present application, before performing iterative computation on the absolute difference, a learning rate for adjusting the absolute difference, a training batch size batch _ size, and the number of iterations are preset.
In the process of specifically implementing step S28, a time estimation algorithm ADAM is used to train the speech content feature matrix speech information feature matrix of the training batch size batch _ size based on the learning rate, and the trained speech content feature matrix speech information feature matrix of the training batch size batch _ size is input to the initial tone conversion model for iterative training to determine whether the loss function converges to the minimum value. And if the loss function does not converge to the minimum value, continuously inputting the trained voice content feature matrix voice information feature matrix of the training batch size batch _ size into the initial tone conversion model for iterative training so as to make the loss function converge to the minimum value, thereby determining the final tone conversion model.
Note that the learning rate is used to indicate the magnitude of the weight for updating the tone conversion model.
The training BATCH SIZE BATCH _ SIZE is a speech content feature matrix and a speech information feature matrix required for each training of the tone conversion model.
The iteration times refer to the times of inputting the size of the whole training batch into the tone conversion model for training.
Step S14: and constructing a tone conversion network based on the trained voice content recognition model, the trained voice speaker recognition model, the trained tone conversion model and the trained vocoder model.
In the process of implementing step S14, a sound-color conversion network is constructed by using the trained speech content recognition model, the trained speaker recognition model, the sound-color conversion model, and the vocoder model.
In the embodiment of the present application, the embodiment of the present application further discloses a structure of the established tone color conversion network, as shown in fig. 3.
The process of performing tone conversion on the original audio based on the established tone conversion network to obtain the converted target audio is as follows:
inputting the voice content of the original audio into the architecture of the tone conversion network shown in fig. 3, so that the input original audio in the tone conversion network is processed to obtain a voice content feature matrix; inputting the target tone into the architecture of the tone conversion network shown in fig. 3, so that the speaker recognition model in the tone conversion network is used for processing the target audio to obtain a voice information feature matrix; inputting the combined matrix of the content characteristic matrix and the voice information characteristic matrix into a tone conversion model shown in fig. 3, and processing the combined matrix of the content characteristic matrix and the voice information characteristic matrix by the tone conversion model to obtain acoustic characteristics; the acoustic features are input to the vocoder model of the tone conversion network shown in fig. 3, so that the vocoder model in the tone conversion network converts the acoustic features into target audio composed of target tone and speech content.
In the embodiment of the application, the pre-constructed tone conversion network is used for carrying out tone conversion on the original audio input by the anchor based on the terminal in real time, so that the quality of the converted audio and the similarity between the tone of the converted audio and the target tone can be ensured. And mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Based on the live broadcasting method with wheat connection shown in the embodiment of the present application, referring to fig. 3, a schematic flow diagram of another live broadcasting method with wheat connection shown in the embodiment of the present application is shown, where the method is applied to a server, and the method includes:
step S401: and determining whether the live broadcasting room with the current microphone is in a microphone live broadcasting PK mode, if so, executing the step S402, and if not, continuously utilizing the live broadcasting mode with the current microphone live broadcasting to carry out live broadcasting.
In the specific implementation process of step S401, in the process of live broadcasting by multiple terminals connected to the microphone, a live broadcasting mode of a live broadcasting room connected to the microphone is acquired, and it is determined whether the live broadcasting mode of the live broadcasting room connected to the microphone is a microphone-connected live broadcasting PK mode, if yes, step S402 is executed, and if not, the live broadcasting mode connected to the microphone currently is continuously utilized for live broadcasting.
Step S402: and when the live broadcasting PK mode is finished, acquiring live broadcasting data of all terminals for carrying out live broadcasting PK.
It should be noted that the live broadcast data is used to indicate the live broadcast preference of the user, and the number of all terminals is a positive integer greater than or equal to 2.
In the process of implementing step S402, when the live broadcast PK mode is finished, the live broadcast preference of each terminal performing live broadcast PK is obtained.
Step S403: and determining a target terminal and other terminals based on the live broadcast data of all the terminals.
In step S403, the target terminal is used to indicate the terminal with the highest live broadcast preference, and the other terminals are terminals with live broadcast preference lower than the highest live broadcast preference.
In the process of implementing step S403 specifically, live broadcast preference of each live broadcast PK terminal for each connected wheat is compared, and the live broadcast PK terminals are sorted according to the live broadcast preference. Then, the terminal with the highest live broadcast liking degree is set as a target terminal, and the terminal with the live broadcast liking degree lower than the highest live broadcast liking degree is set as other terminals.
Step S404: and sending the authority for switching the live broadcast modes of other terminals to the target terminal.
It should be noted that the right to switch the live broadcast mode of the other terminal carries a target audio and a switch duration selection instruction.
In the process of implementing step S404 specifically, the server sends the target terminal the right to switch the live mode of the other terminal, which carries the target audio and the switching duration selection instruction.
Optionally, the target terminal selects a target tone and a preset duration based on a target tone and preset duration selection instruction carried by the authority for switching the live broadcast mode of the other terminal; and packaging the selected target tone and the preset duration to generate a mode switching request for switching the live broadcast modes of other terminals, and sending the mode switching request to the server.
Step S405: and judging whether a mode switching request for switching the live mode of the other terminal sent by the target terminal is received, if so, executing the step S406, and if not, continuing to execute the step S405.
In the process of implementing step S405 specifically, it is determined whether a mode switching request for switching the live mode of another terminal is received, if so, step S406 is executed, and if not, step S405 is continuously executed.
It should be noted that the mode switching request carries a target tone and a preset duration selected by the target terminal.
Step S406: and switching the live broadcast mode of other terminals into a sound-changing live broadcast mode based on the mode switching request, so that the other terminals are in the sound-changing live broadcast mode within the preset time length.
In step S406, the mode switching request carries a target tone corresponding to the voice-changing live broadcast mode.
In the process of implementing step S406 specifically, the server switches the live broadcast mode of the terminal whose live broadcast preference is lower than the highest live broadcast preference to the voice-changing live broadcast mode by using the received mode switching request, and makes the duration of the voice-changing live broadcast mode of the terminal whose live broadcast preference is lower than the highest live broadcast preference equal to the preset duration.
Step S407: and acquiring original audio input by other terminals in real time.
Optionally, the anchor of each other terminal inputs the original audio in real time based on the microphone of the terminal itself, and sends the input original audio to the server.
In the process of implementing step S407, the server obtains the original audio input by each other terminal in real time.
Step S408: and performing tone conversion on the original audio based on the target tone carried by the mode switching request to obtain the converted target audio.
It should be noted that the specific implementation content of step S408 is the same as the specific implementation content of step S203 shown in the above embodiments, and reference may be made to each other.
It should be noted that the number of target audios is the same as that of other terminals.
Step S409: and mixing the target audio, the acquired original audio input by the target terminal and the acquired original audio input by other terminals which are connected with the microphones but are not subjected to live broadcast PK, and sending the mixed audio to all terminals connected with the microphones and a spectator end entering a live broadcast room with a plurality of terminals connected with the microphones.
In the process of implementing step S409 specifically, a mixed flow technique is used to mix multiple channels of audio, such as the target audio, the acquired original audio input by the target terminal, and the acquired original audio input by other terminals that are connected to the microphone but not performing PK live broadcast, to generate a channel of audio stream, that is, a mixed flow audio. And then the mixed stream audio is sent to all live TV terminals and the audience terminals entering the live TV room.
It should be noted that the specific implementation process of step S409 is the same as the specific implementation process of step S204, and reference may be made to this.
It should be noted that the target terminal may perform mixing not only with the original audio shown above, but also with the target audio after triggering the sound-changing live broadcast mode, which is not limited in this application.
The mixed stream of other terminals connected to the microphone but not live PK may be not only the original audio shown above, but also the target audio after triggering the sound-changing live mode, and the application is not limited thereto.
In the embodiment of the application, when a mode switching request for switching the live broadcast modes of other terminals sent by a target terminal is received, the live broadcast modes of the other terminals are switched to the voice-changing live broadcast mode based on the mode switching request, so that the other terminals are in the voice-changing live broadcast mode within a preset time length. And acquiring original audio input by other terminals in real time, and performing tone conversion on the original audio based on the target tone carried by the mode switching request to obtain the converted target audio. And then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
In order to better explain the live broadcast disclosed in the embodiment of the present application, a specific application example is explained below.
Suppose that anchor 1 carries out live broadcast on the basis of terminal a and anchor 2 on the basis of terminal b.
The server determines that the live broadcast room in which the anchor 1 performs live broadcast based on the terminal a and the anchor 2 based on the terminal b is in the live broadcast PK mode, as shown in fig. 5.
Wherein, the anchor ID of the anchor 1 is the young orange, and the live broadcast ID of the anchor 2 is the week.
The server acquires live broadcast data of a terminal a and a terminal b, wherein the live broadcast data of the terminal a means that the audience preference degree is 16984, and the live broadcast data of the terminal b means that the audience preference degree is 101.
The server compares the audience likeness 16984 of the terminal a with the audience likeness 101 of the terminal b, and sorts the terminals connected with the wheat according to the level of the live broadcast likeness. Then, it is determined that terminal a is the target terminal and terminal b is the other terminal.
And the server sends the permission of switching other terminals, namely the terminal b live broadcast modes, carrying the target audio and the switching duration selection instruction to the target terminal, namely the terminal a.
The terminal a selects a target tone F and preset duration for 30 minutes based on a target tone and preset duration selection instruction carried by the authority for switching the live broadcast modes of other terminals; and packaging the selected target timbre F and the preset time length for 30 minutes to generate a mode switching request for switching the live broadcast modes of other terminals, and sending the mode switching request to the server.
When a mode switching request for switching the live broadcast modes of other terminals, which is sent by the terminal a, is determined to be received, the live broadcast mode of the terminal b is switched to the voice-changing live broadcast mode by using the received mode switching request, so that the terminal b is in the voice-changing live broadcast mode within 30 minutes of a preset time length.
And acquiring the original audio input by the terminal b in real time, and performing tone conversion on the original tone in the original audio to ensure that the converted tone is the same as the target tone, thereby determining that the target audio is formed by the target tone and the voice content of the original audio.
And mixing the target audio converted by the terminal b with the original audio of the terminal a and the like by using a mixed flow technology to generate mixed flow audio. The mixed stream audio is then sent to terminal a, terminal b, and the audience entering the live webcast room.
In the embodiment of the application, live wheat broadcast is carried out in the above mode, the live broadcast watching experience of a user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Based on the live broadcast method with wheat connection shown in the embodiment of the present application, referring to fig. 5, a schematic flow diagram of another live broadcast method with wheat connection shown in the embodiment of the present application is shown, where the method is applicable to a server, and the method includes:
step S601: and acquiring user data of a viewer end initiating a microphone connecting request in the microphone connecting live broadcast room.
In the process of implementing step S601, the server determines the audience that triggers the microphone connecting request, and obtains user data of all the audiences.
Step S602: and judging the direct broadcasting room permission type of the audience terminal based on the user data of the audience terminal, if the audience terminal has special permission, executing the step S603 to the step S605, and if the audience terminal has common permission, executing the step S606 to the step S608.
It should be noted that the type of the live room right includes a special right and a general right.
In the process of implementing step S602 specifically, it is determined whether there is a viewer side with a data content greater than or equal to the specific user data in the user data of all the viewer sides, and if so, it indicates that the live broadcast room permission type of the viewer side with the data content greater than or equal to the specific user data is a special permission, and steps S603 to S605 are executed. If not, it indicates that the type of the live broadcast room right of the viewer side smaller than the specific user data is a normal right, and step S606 to step S608 are executed.
Step S603: and acquiring original audio input by the audience in real time based on the audience and target timbre selected by the audience based on the audience.
Step S604: and performing tone conversion on the original audio based on the target tone to obtain the converted target audio.
Step S605: and mixing the target audio with the acquired audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to all the terminals connected with the microphone and a spectator end entering a live broadcasting room connected with the microphone.
It should be noted that the specific implementation process of step S603 to step S605 is the same as the specific implementation process of step S202 to step S204, and can be referred to each other.
Step S606: and acquiring original audio input by the audience in real time based on the audience and target timbre selected by the audience based on the audience.
Step S607: and performing tone conversion on the original audio based on the target tone to obtain the converted target audio.
It should be noted that the specific implementation process of step S606 to step S607 is the same as the specific implementation process of step S202 to step S204, and they can be referred to each other.
Step S608: and transmitting the target audio to the audience.
In the process of implementing step S608, an audio stream is generated based on the target audio, so that the audio can be stably and continuously transmitted to the audience.
In the embodiment of the application, the direct broadcasting room authority of the user is determined based on the user data; if the audience has special authority, the server performs tone conversion on the original audio input by the audience in real time in the live broadcast mode to obtain target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. If the audience has common authority, the server performs tone conversion on the original audio input by the audience in real time in the live broadcast mode to obtain target audio; for viewing by itself. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Corresponding to the live broadcast method disclosed in fig. 2 in the embodiment of the present application, the embodiment of the present application further discloses a schematic structural diagram of a live broadcast device, as shown in fig. 7, the live broadcast device includes:
the first obtaining module 701 is configured to, in a live broadcast process in which multiple terminals are connected to a microphone, obtain an original audio input by a anchor based on a terminal in real time and obtain a target tone selected by the anchor based on the terminal if any terminal triggers a variable-sound live broadcast mode.
It should be noted that the original audio includes the speech content and the original timbre.
And the tone conversion network 702 is configured to perform tone conversion on an original tone in the original audio based on the target tone, so as to obtain a converted target audio.
It should be noted that the target audio is composed of a target tone and a speech content.
The first sending module 703 is configured to mix the target audio with the acquired original audio input by the other terminals connected to the microphone to obtain mixed audio, and send the mixed audio to the other terminals connected to the microphone and to a viewer end entering a live broadcast room with multiple terminals connected to the microphone.
It should be noted that, the specific principle and the execution process of each unit in the live wheat connecting device disclosed in the embodiment of the present application are the same as those of the live wheat connecting method implemented in the present application, and reference may be made to corresponding parts in the live wheat connecting method disclosed in the embodiment of the present application, which are not described herein again.
In the embodiment of the application, the server performs tone conversion on the original audio input by the audience terminal triggering the sound change live broadcast mode in real time to obtain the target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room.
Optionally, the tone conversion network 702 is pre-constructed by the speech content recognition model, the speech speaker recognition model, the tone conversion model and the vocoder model.
And the voice content recognition model is used for processing the input original audio to obtain a content characteristic matrix.
And the voice speaker recognition model is used for processing the target audio to obtain a voice information characteristic matrix.
And the tone conversion model is used for processing the input combination matrix of the content characteristic matrix and the voice information characteristic matrix to obtain acoustic characteristics, and is constructed by a separation gate convolution layer, a bidirectional long-time memory network and a full connection layer.
A vocoder model for converting the acoustic features into a target audio comprised of the target timbre and the speech content.
In the embodiment of the application, the pre-constructed tone conversion network is used for carrying out tone conversion on the original audio input by the anchor based on the terminal in real time, so that the quality of the converted audio and the similarity between the tone of the converted audio and the target tone can be ensured. And mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Corresponding to the live broadcast method with wheat disclosed in fig. 4 in the embodiment of the present application, the embodiment of the present application further discloses a schematic structural diagram of a live broadcast device with wheat, as shown in fig. 8, the device includes:
the second obtaining module 801 is configured to determine that a live broadcast room with live microphones is in a live broadcast PK mode with live microphones when live broadcasts are performed by multiple terminals with live microphones, obtain live broadcast data of all terminals performing live broadcasts PK with live microphones when the live broadcast PK mode with live microphones is finished, and obtain original audio input by other terminals in real time.
A first determining module 802, configured to determine a target terminal and other terminals based on live broadcast data of all terminals.
It should be noted that the target terminal is used to indicate the terminal with the highest live broadcast preference, and the other terminals are terminals with live broadcast preference lower than the highest live broadcast preference.
A second sending module 803, configured to send a right to switch a live mode of another terminal to the target terminal, and if a mode switching request for switching the live mode of another terminal sent by the target terminal is received, execute the switching module 804.
The switching module 804 switches the live broadcast mode of the other terminal to the voice-changing live broadcast mode based on the mode switching request, so that the other terminal is in the voice-changing live broadcast mode within the preset time length.
It should be noted that the mode switching request carries a target tone corresponding to the voice-changing live broadcast mode.
And the tone conversion network 702 is configured to perform tone conversion on the original audio based on the target tone carried by the mode switching request, so as to obtain a converted target audio.
A third sending module 805, configured to mix the target audio, the obtained original audio input by the target terminal, and the obtained original audio input by another terminal that is connected to the microphone but is not in the live PK, and send the mixed audio to all terminals connected to the microphone and the audience that enters the live microphone room.
It should be noted that, the specific principle and the execution process of each unit in the live wheat connecting device disclosed in the embodiment of the present application are the same as those of the live wheat connecting method implemented in the present application, and reference may be made to corresponding parts in the live wheat connecting method disclosed in the embodiment of the present application, which are not described herein again.
In the embodiment of the application, when a mode switching request for switching the live broadcast modes of other terminals sent by a target terminal is received, the live broadcast modes of the other terminals are switched to the voice-changing live broadcast mode based on the mode switching request, so that the other terminals are in the voice-changing live broadcast mode within a preset time length. And acquiring original audio input by other terminals in real time, and performing tone conversion on the original audio based on the target tone carried by the mode switching request to obtain the converted target audio. And then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
Corresponding to the live broadcast method with wheat disclosed in fig. 6 in the embodiment of the present application, the embodiment of the present application further discloses a schematic structural diagram of a live broadcast device with wheat, as shown in fig. 9, the device includes:
a third obtaining module 901, configured to obtain user data of a viewer initiating a mic connecting request in a mic connecting live broadcast room.
The determining module 902 is configured to determine a live broadcast room permission type of the viewer based on user data of the viewer, execute the fourth obtaining module 903 if the viewer has a special permission, and execute the fifth obtaining module 905 if the viewer has a normal permission.
And a fourth obtaining module 903, configured to obtain original audio input by the viewer in real time based on the viewer side, and a target timbre selected by the viewer based on the viewer side.
And the tone conversion network 702 is configured to perform tone conversion on the original audio based on the target tone to obtain a converted target audio.
And a fourth sending module 904, configured to mix the target audio with the acquired audio input by the other connected-to-microphone terminals to obtain mixed audio, and send the mixed audio to all connected-to-microphone terminals and the audience entering the connected-to-microphone live broadcast room.
A fifth obtaining module 905, configured to obtain an original audio input by the viewer in real time based on the viewer side, and a target timbre selected by the viewer based on the viewer side.
And the tone conversion network 702 is configured to perform tone conversion on the original audio based on the target tone to obtain a converted target audio.
A fifth sending module 906, configured to send the target audio to the viewer.
It should be noted that, the specific principle and the execution process of each unit in the live wheat connecting device disclosed in the embodiment of the present application are the same as those of the live wheat connecting method implemented in the present application, and reference may be made to corresponding parts in the live wheat connecting method disclosed in the embodiment of the present application, which are not described herein again.
In the embodiment of the application, the direct broadcasting room authority of the user is determined based on the user data; if the audience has special authority, the server performs tone conversion on the original audio input by the audience in real time in the live broadcast mode to obtain target audio; and then mixing the converted target audio with the acquired original audio input by other terminals connected with the microphone so as to facilitate the viewing of audiences and the main broadcast entering the live broadcast room. If the audience has common authority, the server performs tone conversion on the original audio input by the audience in real time in the live broadcast mode to obtain target audio; for viewing by itself. Live wheat is connected through the mode, the live broadcast watching experience of the user can be improved, and the viscosity of the user to a live broadcast platform is increased.
The embodiment of the application provides electronic equipment, wherein the electronic equipment comprises a processor and a memory, the memory is used for storing program codes and data of voice tone conversion, and the processor is used for calling program instructions in the memory to execute the live broadcasting method in the embodiment.
The embodiment of the application provides a storage medium, wherein the storage medium comprises a storage program, and when the program runs, equipment where the storage medium is located is controlled to execute the live broadcasting method with the connected microphones, which is shown in the embodiment of the application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A live broadcast method with wheat is characterized by being applicable to a server, and the method comprises the following steps:
in the process of carrying out live broadcasting with wheat by a plurality of terminals, if any terminal triggers a variable-sound live broadcasting mode, acquiring original audio input by a main broadcast in real time based on the terminal and target tone selected by the main broadcast based on the terminal, wherein the original audio comprises voice content and original tone;
performing tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio, wherein the target audio is composed of the target tone and the voice content;
and mixing the target audio with the acquired original audio input by other connected microphone terminals to obtain mixed audio, and sending the mixed audio to all connected microphone terminals and a spectator end entering a connected microphone live broadcast room.
2. The method of claim 1, wherein during live broadcast by connecting multiple terminals to a microphone, the method further comprises:
determining whether a live broadcast room with the current live broadcast is in a live broadcast PK mode with the live broadcast;
if yes, when the live broadcasting PK mode is finished, acquiring live broadcasting data of all terminals for carrying out live broadcasting PK through live broadcasting;
determining a target terminal and other terminals based on the live broadcast data of all the terminals, wherein the target terminal is used for indicating the terminal with the highest live broadcast preference, and the other terminals are terminals with the live broadcast preference lower than the highest live broadcast preference;
sending the permission of switching the live broadcast modes of other terminals to the target terminal;
if a mode switching request for switching the live broadcast modes of other terminals, which is sent by the target terminal, is received, switching the live broadcast modes of the other terminals into a sound-changing live broadcast mode based on the mode switching request, so that the other terminals are in the sound-changing live broadcast mode within a preset time length, wherein the mode switching request carries a target tone corresponding to the sound-changing live broadcast mode;
acquiring original audio input by the other terminals in real time;
performing tone conversion on the original audio based on the target tone carried by the mode switching request to obtain a converted target audio;
and mixing the target audio, the acquired original audio input by the target terminal and the acquired original audio input by other terminals which are connected with the microphone but are not in live broadcast PK, and sending the mixed audio to all terminals connected with the microphone and a spectator end entering a live broadcast room of the microphone.
3. The method of claim 1, wherein during live broadcast by connecting multiple terminals to a microphone, the method further comprises:
acquiring user data of a spectator end initiating a microphone connecting request in the microphone connecting live broadcast room;
judging the direct broadcasting room permission type of the audience terminal based on the user data of the audience terminal;
if the audience has special authority, acquiring an original audio input by the audience in real time based on the audience and a target tone selected by the audience based on the audience;
performing tone conversion on the original audio based on the target tone to obtain a converted target audio;
and mixing the target audio with the acquired audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to all the terminals connected with the microphone and a spectator end entering a live microphone room.
4. The method of claim 3, further comprising:
if the audience terminal has common authority, acquiring an original audio input by the audience terminal in real time based on the audience terminal and a target tone selected by the audience terminal based on the audience terminal;
performing tone conversion on the original audio based on the target tone to obtain a converted target audio;
and transmitting the target audio to the audience.
5. The method of claim 1, wherein performing timbre conversion on the original timbres in the original audio based on the target timbre to obtain a converted target audio comprises:
performing tone conversion on the original audio by using a tone conversion network to obtain a converted target audio, wherein the target audio consists of the target tone and the voice content;
the tone conversion network is constructed in advance by a voice content identification model, a voice speaker identification model, a tone conversion model and a vocoder model, and the original audio is input into the voice content identification model for processing to obtain a content characteristic matrix; inputting the target audio frequency into the voice speaker recognition model for processing to obtain a voice information characteristic matrix; taking the combined matrix of the content characteristic matrix and the voice information characteristic matrix as the input of the tone conversion model, and processing the combined matrix of the content characteristic matrix and the voice information characteristic matrix based on the tone conversion model to obtain acoustic characteristics, wherein the tone conversion model is constructed by a separation gate convolution layer, a bidirectional long-time memory network and a full connection layer; and inputting the acoustic features into the vocoder model for audio conversion to obtain a converted target audio, wherein the target audio consists of the target tone and the voice content.
6. A live wheat-connecting device, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an original audio frequency input by a main broadcast in real time based on a terminal and a target tone selected by the main broadcast based on the terminal when any terminal triggers a variable-sound live broadcast mode in the live broadcast process of connecting multiple terminals with microphones, and the original audio frequency comprises voice content and an original tone;
the tone conversion network is used for carrying out tone conversion on the original tone in the original audio based on the target tone to obtain a converted target audio, and the target audio consists of the target tone and the voice content;
and the first sending module is used for mixing the target audio with the acquired original audio input by other terminals connected with the microphone to obtain mixed audio, and sending the mixed audio to the other terminals connected with the microphone and a spectator end entering a live microphone room.
7. The apparatus of claim 6, wherein the timbre conversion network is pre-constructed from the speech content recognition model, the speech speaker recognition model, the timbre conversion model, and the vocoder model;
the voice content recognition model is used for processing the input original audio to obtain a content characteristic matrix;
the voice speaker recognition model is used for processing the target audio to obtain a voice information characteristic matrix;
the tone conversion model is used for processing the input combination matrix of the content characteristic matrix and the voice information characteristic matrix to obtain acoustic characteristics, and is constructed by a separation gate convolution layer, a bidirectional long-term memory network and a full connection layer;
the vocoder model is used for converting the acoustic features into target audio, and the target audio is composed of the target tone and the voice content.
8. The apparatus of claim 6, further comprising:
the second acquisition module is used for determining that a live broadcast room with the connected microphones is in a microphone live broadcast PK mode in the process of live broadcast of the connected microphones of the terminals, acquiring live broadcast data of all terminals carrying out microphone live broadcast PK when the microphone live broadcast PK mode is finished, and acquiring original audio input by other terminals in real time;
the first determining module is used for determining a target terminal and other terminals based on the live broadcast data of all the terminals, wherein the target terminal is used for indicating the terminal with the highest live broadcast preference, and the other terminals are terminals with the live broadcast preference lower than the highest live broadcast preference;
the second sending module is used for sending the permission of switching the live broadcast modes of other terminals to the target terminal;
the switching module is used for switching the live broadcast mode of the other terminal into the sound-changing live broadcast mode based on the mode switching request if the mode switching request for switching the live broadcast mode of the other terminal sent by the target terminal is received, so that the other terminal is in the sound-changing live broadcast mode within a preset time length, and the mode switching request carries a target tone corresponding to the sound-changing live broadcast mode;
the tone conversion network is used for carrying out tone conversion on the original audio based on the target tone carried by the mode switching request to obtain a converted target audio;
and the third sending module is used for mixing the target audio, the obtained original audio input by the target terminal and the obtained original audio input by other terminals which are connected with the microphones but are not subjected to live PK, sending the mixed audio to all terminals connected with the microphones and entering a spectator end of a live microphone room.
9. An electronic device, comprising a processor and a memory, wherein the memory is configured to store program code and data for voice timbre conversion, and wherein the processor is configured to invoke program instructions in the memory to perform a live telecast method as claimed in any one of claims 1-5.
10. A storage medium, characterized in that the storage medium comprises a storage program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the live wheat connecting method according to any one of claims 1-5.
CN202010940695.7A 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment Active CN112019874B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111073971.5A CN113784163B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment
CN202010940695.7A CN112019874B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010940695.7A CN112019874B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202111073971.5A Division CN113784163B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment

Publications (2)

Publication Number Publication Date
CN112019874A true CN112019874A (en) 2020-12-01
CN112019874B CN112019874B (en) 2022-02-08

Family

ID=73522123

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111073971.5A Active CN113784163B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment
CN202010940695.7A Active CN112019874B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202111073971.5A Active CN113784163B (en) 2020-09-09 2020-09-09 Live wheat-connecting method and related equipment

Country Status (1)

Country Link
CN (2) CN113784163B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113411327A (en) * 2021-06-17 2021-09-17 广州方硅信息技术有限公司 Audio adjustment method, system, device, equipment and medium for voice live broadcast
CN113873277A (en) * 2021-09-26 2021-12-31 山西大学 Delay-security folk literature and art promotion system based on network live broadcast platform
CN114125494A (en) * 2021-09-29 2022-03-01 阿里巴巴(中国)有限公司 Content auditing auxiliary processing method and device and electronic equipment
CN114390304A (en) * 2021-12-20 2022-04-22 北京达佳互联信息技术有限公司 Live broadcast sound changing method and device, electronic equipment and storage medium
CN115412773A (en) * 2021-05-26 2022-11-29 武汉斗鱼鱼乐网络科技有限公司 Method, device and system for processing audio data of live broadcast room
WO2023102932A1 (en) * 2021-12-10 2023-06-15 广州虎牙科技有限公司 Audio conversion method, electronic device, program product, and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080097760A1 (en) * 2006-10-23 2008-04-24 Sungkyunkwan University Foundation For Corporate Collaboration User-initiative voice service system and method
CN105933738A (en) * 2016-06-27 2016-09-07 徐文波 Live video streaming method, device and system
CN106303586A (en) * 2016-08-18 2017-01-04 北京奇虎科技有限公司 A kind of method of network direct broadcasting, main broadcaster's end equipment, viewer end equipment
CN110062267A (en) * 2019-05-05 2019-07-26 广州虎牙信息科技有限公司 Live data processing method, device, electronic equipment and readable storage medium storing program for executing
CN110324653A (en) * 2019-07-31 2019-10-11 广州华多网络科技有限公司 Game interaction exchange method and system, electronic equipment and the device with store function
CN110460867A (en) * 2019-07-31 2019-11-15 广州华多网络科技有限公司 Even wheat interactive approach, even wheat interaction systems, electronic equipment and storage medium
CN111583944A (en) * 2019-01-30 2020-08-25 北京搜狗科技发展有限公司 Sound changing method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108900920B (en) * 2018-07-20 2020-11-10 广州虎牙信息科技有限公司 Live broadcast processing method, device, equipment and storage medium
CN110071938B (en) * 2019-05-05 2021-12-03 广州虎牙信息科技有限公司 Virtual image interaction method and device, electronic equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080097760A1 (en) * 2006-10-23 2008-04-24 Sungkyunkwan University Foundation For Corporate Collaboration User-initiative voice service system and method
CN105933738A (en) * 2016-06-27 2016-09-07 徐文波 Live video streaming method, device and system
CN106303586A (en) * 2016-08-18 2017-01-04 北京奇虎科技有限公司 A kind of method of network direct broadcasting, main broadcaster's end equipment, viewer end equipment
CN111583944A (en) * 2019-01-30 2020-08-25 北京搜狗科技发展有限公司 Sound changing method and device
CN110062267A (en) * 2019-05-05 2019-07-26 广州虎牙信息科技有限公司 Live data processing method, device, electronic equipment and readable storage medium storing program for executing
CN110324653A (en) * 2019-07-31 2019-10-11 广州华多网络科技有限公司 Game interaction exchange method and system, electronic equipment and the device with store function
CN110460867A (en) * 2019-07-31 2019-11-15 广州华多网络科技有限公司 Even wheat interactive approach, even wheat interaction systems, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MEINARD MULLER; SEBASTIAN EWERT; SEBASTIAN KREUZER: "Making chroma features more robust to timbre changes", 《 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 *
倪素萍、张建平、颜永红、吕士楠: "语音音色变换的现有技术分析", 《第七届全国人机语音通讯学术会议(NCMMSC7)论文集》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115412773A (en) * 2021-05-26 2022-11-29 武汉斗鱼鱼乐网络科技有限公司 Method, device and system for processing audio data of live broadcast room
CN113411327A (en) * 2021-06-17 2021-09-17 广州方硅信息技术有限公司 Audio adjustment method, system, device, equipment and medium for voice live broadcast
CN113411327B (en) * 2021-06-17 2023-02-17 广州方硅信息技术有限公司 Audio adjustment method, system, device, equipment and medium for voice live broadcast
CN113873277A (en) * 2021-09-26 2021-12-31 山西大学 Delay-security folk literature and art promotion system based on network live broadcast platform
CN114125494A (en) * 2021-09-29 2022-03-01 阿里巴巴(中国)有限公司 Content auditing auxiliary processing method and device and electronic equipment
WO2023102932A1 (en) * 2021-12-10 2023-06-15 广州虎牙科技有限公司 Audio conversion method, electronic device, program product, and storage medium
CN114390304A (en) * 2021-12-20 2022-04-22 北京达佳互联信息技术有限公司 Live broadcast sound changing method and device, electronic equipment and storage medium
CN114390304B (en) * 2021-12-20 2023-08-08 北京达佳互联信息技术有限公司 Live broadcast sound changing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112019874B (en) 2022-02-08
CN113784163B (en) 2023-06-20
CN113784163A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN112019874B (en) Live wheat-connecting method and related equipment
US20100324894A1 (en) Voice to Text to Voice Processing
JP2019216408A (en) Method and apparatus for outputting information
CN108920128B (en) Operation method and system of presentation
CN106796496A (en) Display device and its operating method
CN112735373A (en) Speech synthesis method, apparatus, device and storage medium
CN103491411A (en) Method and device based on language recommending channels
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN106792013A (en) A kind of method, the TV interactive for television broadcast sounds
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
CN110933485A (en) Video subtitle generating method, system, device and storage medium
CN110392273A (en) Method, apparatus, electronic equipment and the storage medium of audio-video processing
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN114025235A (en) Video generation method and device, electronic equipment and storage medium
CN116756285A (en) Virtual robot interaction method, device and storage medium
CN112634886A (en) Interaction method of intelligent equipment, server, computing equipment and storage medium
CN106547731B (en) Method and device for speaking in live broadcast room
JP2001268078A (en) Communication controller, its method, providing medium and communication equipment
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
KR20220154655A (en) Device, method and computer program for generating voice data based on family relationship
CN115359796A (en) Digital human voice broadcasting method, device, equipment and storage medium
CN113889070A (en) Voice synthesis method and device for voice synthesis
CN116561294A (en) Sign language video generation method and device, computer equipment and storage medium
CN113345452A (en) Voice conversion method, training method, device and medium of voice conversion model
CN112349271A (en) Voice information processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210114

Address after: 510000 3108, 79 Wanbo 2nd Road, Nancun Town, Panyu District, Guangzhou City, Guangdong Province

Applicant after: GUANGZHOU CUBESILI INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 28th floor, block B1, Wanda Plaza, Nancun Town, Panyu District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU HUADUO NETWORK TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20201201

Assignee: GUANGZHOU HUADUO NETWORK TECHNOLOGY Co.,Ltd.

Assignor: GUANGZHOU CUBESILI INFORMATION TECHNOLOGY Co.,Ltd.

Contract record no.: X2021440000054

Denomination of invention: A continuous wheat direct seeding method and related equipment

License type: Common License

Record date: 20210208

EE01 Entry into force of recordation of patent licensing contract
GR01 Patent grant
GR01 Patent grant