CN114630144A

CN114630144A - Audio replacement method, system and device in live broadcast room and computer equipment

Info

Publication number: CN114630144A
Application number: CN202210208196.8A
Authority: CN
Inventors: 曾家乐
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-14
Anticipated expiration: 2042-03-03
Also published as: CN114630144B

Abstract

The application relates to the technical field of network live broadcast, and provides an audio replacing method, a system, a device and computer equipment in a live broadcast room, wherein the method comprises the following steps: the server responds to the audio conversion instruction, and obtains first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the audience client responds to the audio generation instruction, and inputs first text stream data to an audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the viewer client replaces the first audio stream data output in the live broadcast with the second audio stream data. Compared with the prior art, the method and the device can effectively improve the phenomenon that the audio card is pause, and improve the live broadcast experience of audiences.

Description

Audio replacement method, system and device in live broadcast room and computer equipment

Technical Field

The embodiment of the application relates to the technical field of network live broadcast, in particular to an audio replacing method, system and device in a live broadcast room and computer equipment.

Background

With the rapid development of communication technology and streaming media technology, webcast is more and more popular with multiple users, and various offline activities can be developed in webcast rooms, for example: interview activities, gaming activities, and acquaintance activities, etc.

In the process of network live broadcast, a main broadcast client side collects audio and video stream data in real time and encodes the audio and video stream data, then the encoded audio and video stream data are sent to a server, a spectator client side added into a live broadcast room (namely, a live broadcast channel) pulls the encoded audio and video stream data from the server and decodes and plays the audio and video stream data, and therefore spectators can watch live broadcast content in the live broadcast room.

In the process, due to the interference of a plurality of factors such as low network transmission rate, data packet loss, high coding and decoding complexity and the like, the phenomenon of audio and video blocking in a live broadcast room is often caused, so that the network live broadcast experience of audiences is influenced, the loss of the audiences is caused, and the transmission, coding and decoding of real-time audio and video stream data also easily cause the over-high load of a server and a client and consume a large amount of available flow of the audiences.

Disclosure of Invention

The embodiment of the application provides an audio replacing method, a system, a device and computer equipment in a live broadcast room, which can solve the technical problems of poor network live broadcast experience of audiences and audience loss caused by audio and video blockage and excessive flow consumption, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for replacing audio in a live broadcast room, including:

the server responds to an audio conversion instruction, and obtains first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to the first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room;

the audience client responds to an audio generation instruction to obtain the first text stream data and a first speaker identifier corresponding to the first text stream data; inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier;

the viewer client replaces the first audio stream data output in the live broadcast with the second audio stream data.

In a second aspect, an embodiment of the present application provides an audio replacing system in a live broadcast room, including: a server and a viewer client;

the server is used for responding to an audio conversion instruction and acquiring first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to the first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to the audience client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room;

the audience client is used for responding to an audio generation instruction and acquiring the first text stream data and a first speaker identifier corresponding to the first text stream data; inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier;

the viewer client is used for replacing the first audio stream data output in the live broadcast into the second audio stream data.

In a third aspect, an embodiment of the present application provides an audio replacing apparatus in a live broadcast room, including:

the first conversion unit is used for responding to an audio conversion instruction by the server and acquiring first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to the first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room;

the first generation unit is used for responding to an audio generation instruction by the audience client and acquiring the first text stream data and a first speaker identifier corresponding to the first text stream data; inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier;

a first replacing unit, configured to replace, by the viewer client, the first audio stream data output in the live broadcast into the second audio stream data.

In a fourth aspect, embodiments of the present application provide a computer device, a processor, a memory, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect.

In a fifth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the method according to the first aspect.

In the embodiment of the application, the server responds to the audio conversion instruction to obtain first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room; the method comprises the steps that a spectator client responds to an audio generation instruction to obtain first text flow data and a first speaker identifier corresponding to the first text flow data; inputting first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier; the viewer client replaces the first audio stream data output in the live broadcast with the second audio stream data. In the embodiment of the application, in the process of issuing audio stream data to a viewer client in real time, a server responds to an audio conversion instruction to obtain first speaker information (namely speaker information corresponding to a current speaker in a live broadcast room) and first audio stream data corresponding to the first speaker information, then the first audio stream data is converted into first text stream data, and a first speaker identifier corresponding to the first text stream data is determined according to the first speaker information, so that the first text stream data and the first speaker identifier corresponding to the first text stream data are sent to the viewer client in the live broadcast room, and further when the viewer client responds to an audio generation instruction, the first text stream data can be input to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data, and then the first audio stream data currently output in the live broadcast room is replaced by the second audio data, therefore, under the situation of audio jamming and the like, audiences can listen to the second audio data simulating the sound of the current speaker in time, the influence of the jamming phenomenon on the live broadcasting experience of the audiences is greatly reduced, and the server can pause the issuing of the audio stream data because the currently output first audio stream data in the live broadcasting room is replaced by the second audio data, so that the purposes of saving the flow and reducing the equipment load are achieved, the live broadcasting experience of the audiences is further improved, and the retention rate and the watching duration of the audiences are improved.

For a better understanding and implementation, the technical solutions of the present application are described in detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic view of an application scenario of an audio replacement method in a live broadcast room according to an embodiment of the present application;

fig. 2 is a schematic display diagram of a video frame in a live webcast scene according to an embodiment of the present application;

fig. 3 is a schematic display diagram of a video frame in a multi-user live scene according to an embodiment of the present application;

fig. 4 is a flowchart illustrating an audio replacing method in a live broadcast room according to a first embodiment of the present application;

fig. 5 is a schematic flowchart of S101 in a method for replacing audio in a live broadcast room according to a first embodiment of the present application;

fig. 6 is another flowchart of an audio replacing method in a live broadcast room according to the first embodiment of the present application;

fig. 7 is a schematic flowchart of an audio replacing method in a live broadcast room according to the first embodiment of the present application;

fig. 8 is a flowchart illustrating an audio replacing method in a live broadcast room according to a second embodiment of the present application;

fig. 9 is another flowchart of an audio replacing method in a live broadcast room according to the second embodiment of the present application;

fig. 10 is a schematic flowchart of an audio replacing method in a live broadcast room according to a second embodiment of the present application;

fig. 11 is a schematic structural diagram of an audio substitution system in a live broadcast room according to a third embodiment of the present application;

fig. 12 is a schematic structural diagram of an audio replacing apparatus in a live broadcast room according to a fourth embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to a fifth embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if/if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

As will be appreciated by those skilled in the art, the terms "client," "terminal device," and "terminal device" as used herein include both wireless signal receiver devices, which include only wireless signal receiver devices without transmit capability, and receiving and transmitting hardware devices, which include receiving and transmitting hardware devices capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (personal communications Service), which may combine voice, data processing, facsimile and/or data communications capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global positioning system) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially a computer device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle, such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., wherein a computer program is stored in the memory, and the central processing unit loads a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby accomplishing specific functions.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of an audio replacing method in a live broadcast room according to an embodiment of the present application, where the application scenario includes an anchor client 101, a server 102, and a viewer client 103 according to an embodiment of the present application, and the anchor client 101 and the viewer client 103 interact with each other through the server 102.

The proposed clients of the embodiment of the present application include the anchor client 101 and the viewer client 103.

It is noted that there are many understandings of the concept of "client" in the prior art, such as: it may be understood as an application program installed in a computer device, or may be understood as a hardware device corresponding to a server.

In the embodiments of the present application, the term "client" refers to a hardware device corresponding to a server, and more specifically, refers to a computer device, such as: smart phones, smart interactive tablets, personal computers, and the like.

When the client is a mobile device such as a smart phone and an intelligent interactive tablet, a user can install a matched mobile application program on the client and can also access a Web application program on the client.

When the client is a non-mobile device such as a Personal Computer (PC), the user can install a matching PC application on the client, and similarly can access a Web application on the client.

The mobile application refers to an application program that can be installed in the mobile device, the PC application refers to an application program that can be installed in the non-mobile device, and the Web application refers to an application program that needs to be accessed through a browser.

Specifically, the Web application program may be divided into a mobile version and a PC version according to the difference of the client types, and the page layout modes and the available server support of the two versions may be different.

In the embodiment of the application, the types of live application programs provided to the user are divided into a mobile end live application program, a PC end live application program and a Web end live application program. The user can autonomously select the mode of participating in the live webcast according to different types of the client adopted by the user.

The present application can divide the clients into a main broadcasting client 101 and a spectator client 103, depending on the identity of the user using the clients.

The anchor client 101 is a client that transmits a live video, and is generally a client used by an anchor (i.e., a live anchor user) in live streaming.

The viewer client 103 refers to an end that receives and views a live video, and is typically a client employed by a viewer viewing a video in a live network (i.e., a live viewer user).

The hardware at which the anchor client 101 and viewer client 103 are directed is essentially a computer device, and in particular, as shown in fig. 1, it may be a type of computer device such as a smart phone, smart interactive tablet, and personal computer. Both the anchor client 101 and the viewer client 103 may access the internet via known network access means to establish a data communication link with the server 102.

Server 102, acting as a business server, may be responsible for further connecting with related audio data servers, video streaming servers, and other servers providing related support, etc., to form a logically associated server cluster for serving related terminal devices, such as anchor client 101 and viewer client 103 shown in fig. 1.

In the embodiment of the present application, the anchor client 101 and the audience client 103 may join in the same live broadcast room (i.e., a live broadcast channel), where the live broadcast room is a chat room implemented by means of an internet technology, and generally has an audio/video broadcast control function. The anchor user is live in the live room through the anchor client 101, and the audience of the audience client 103 can log in the server 102 to enter the live room to watch the live.

In the live broadcast room, interaction between the anchor and the audience can be realized through known online interaction modes such as voice, video, characters and the like, generally, the anchor performs programs for audience users in the form of audio and video streams, and economic transaction behaviors can also be generated in the interaction process. Of course, the application form of the live broadcast room is not limited to online entertainment, and can also be popularized to other relevant scenes, such as a video conference scene, a product recommendation sale scene and any other scenes needing similar interaction.

Specifically, the anchor logs in the server 102 through the anchor client 101, and triggers the anchor client 101 to load a broadcast interface, where a broadcast control is displayed in the broadcast interface, and the anchor can start live broadcast by clicking the broadcast control, and if the anchor is currently in a video live broadcast mode, the anchor client 101 is triggered to collect audio and video stream data, and if the anchor is currently in a voice live broadcast mode, the anchor client 101 is triggered to collect audio stream data.

The video stream data is acquired by a camera establishing data connection with the anchor client 101, and the camera may be a camera of the anchor client 101 or an external camera of the anchor client 101.

Taking a live video mode as an example, the anchor client 101 encodes the acquired audio/video stream data and pushes the encoded audio/video stream data to the server 102.

If the audience enters the live broadcast room created by the main broadcast through the audience client 103, the audience client 103 is triggered to pull the encoded audio/video stream data from the server 102, and the encoded audio/video stream data is decoded and output to a live broadcast interface, so that the audience can watch live broadcast content in the live broadcast room.

The manner of entering the live room created by the anchor is not limited herein, and the viewer can enter the live room created by the anchor by means of a live room recommendation page, manual search of the live room, sliding the live room interface up and down, and the like.

The embodiment of the application describes a multi-user live broadcast scene, which is different from a live broadcast scene with wheat. The live wheat-connecting scene is developed based on the establishment of a wheat-connecting session connection between at least two anchor clients by the server, and audio and video stream data collected by the at least two anchor clients can be output in the live wheat-connecting scene. Referring to fig. 2, fig. 2 is a schematic view illustrating a display of a video frame in a live telecast scene according to an embodiment of the present application. As can be seen from fig. 2, the current link microphone is a 2-bit main microphone, the left side of the video window 21 is a video frame 211 corresponding to the first link microphone, and the right side 211 of the video window 21 is a video frame 212 corresponding to the second link microphone.

The multi-user live broadcast scene provided by the embodiment of the application refers to a live broadcast scene with a plurality of users in the same entity environment, and at the moment, the video pictures collected by the anchor client side comprise images of the plurality of users. Since the webcast process is a process of real-time interaction and communication with audiences, the users may also be called speakers, and the multiple speakers may or may not include a main broadcast for creating the live broadcast room.

For example, in a certain physical environment, a multi-person interview live broadcast is performed, at this time, a video frame collected by the anchor client includes images of a plurality of speakers, and audio stream data collected by the anchor client includes sounds of the plurality of speakers.

Referring to fig. 3, fig. 3 is a schematic view illustrating a video frame displayed in a multi-user live scene according to an embodiment of the present disclosure. It can be seen that the video frame 32 displayed in the video window 31 in fig. 3 includes images of multiple speakers, and the video frame 32 is a video frame captured by a main broadcasting client, and is not a mixed video frame in a live broadcast scene.

In any live broadcast scene, due to the interference of a plurality of factors such as low network transmission rate, data packet loss, high coding and decoding complexity and the like, the phenomenon of audio and video blocking in a live broadcast room is often caused, so that the network live broadcast experience of audiences is influenced, and the loss of the audiences is caused. Based on the above, the embodiment of the application provides an audio replacing method in a live broadcast room. Referring to fig. 4, fig. 4 is a schematic flowchart illustrating an audio replacing method in a live broadcast room according to a first embodiment of the present application, where the method includes the following steps:

s101: the server responds to the audio conversion instruction, and obtains first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room.

S102: the method comprises the steps that a spectator client responds to an audio generation instruction to obtain first text flow data and a first speaker identifier corresponding to the first text flow data; inputting first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier.

S103: the viewer client replaces the first audio stream data output in the live broadcast with the second audio stream data.

In the present embodiment, the audio replacement method in the live broadcast room is described from two execution subjects, i.e., the client and the server. Wherein, the client comprises an anchor client and a spectator client.

Regarding step S101, the server, in response to the audio conversion instruction, acquires the first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; and determining a first speaker identifier corresponding to the first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in the live broadcast room.

Before describing step S101, the flow of audio/video stream data among the anchor client, the server, and the viewer client is described. The method comprises the steps that a main broadcast client side obtains audio and video stream data corresponding to a direct broadcast room identifier and a direct broadcast room identifier, wherein the audio and video stream data corresponding to the direct broadcast room identifier is the audio and video stream data collected by the main broadcast client side, the main broadcast client side sends the audio and video stream data corresponding to the direct broadcast room identifier and the direct broadcast room identifier to a server, the server responds to a direct broadcast room adding request sent by a viewer client side to obtain the direct broadcast room identifier, direct broadcast room interface data corresponding to the direct broadcast room identifier and the audio and video stream data corresponding to the direct broadcast room identifier, the direct broadcast room interface data corresponding to the direct broadcast room identifier and the audio and video stream data corresponding to the direct broadcast room identifier are sent to the viewer client side, the viewer client side receives the direct broadcast room interface data corresponding to the direct broadcast room identifier and the audio and video stream data corresponding to the direct broadcast room identifier, and the audio and video stream data corresponding to the direct broadcast room identifier are loaded according to the direct broadcast room interface data corresponding to the direct broadcast room identifier, and playing audio and video stream data corresponding to the live broadcast room identifier.

The live broadcast room identifier is a unique identifier corresponding to the live broadcast room. The audio-video stream data includes audio stream data and video stream data.

It can be understood that, in the voice live broadcast room, the anchor client only collects audio stream data, and the specific streaming process is basically the same as that described above, except that the viewer client does not output video stream data in the live broadcast room interface, the viewer cannot see the video frame, and the display style of the live broadcast room interface may have a certain difference.

In the embodiment of the present application, in order to implement the function of the live broadcast room, the anchor client, the server, and the audience client may normally perform the above-described audio/video stream data transfer or audio stream data transfer.

When the server responds to the audio conversion instruction, the server acquires the first speaker information and first audio stream data corresponding to the first speaker information.

The audio conversion instruction at least comprises a live broadcast room identifier, so that when the server responds to the audio conversion instruction, the server can confirm which live broadcast room audio stream data needs to be converted.

The first speaker information is speaker information corresponding to a current speaker in the live broadcast room. And the first audio stream data corresponding to the first speaker information is the audio stream data currently output in the live broadcast room.

In an optional embodiment, the first speaker information is face information corresponding to a current speaker in a live broadcast room under video live broadcast, and the first speaker information is wheat sequence information corresponding to the current speaker in the live broadcast room under voice live broadcast.

It can be understood that, as the live broadcast is performed, for a multi-user live broadcast scene, a current speaker in a live broadcast room may be constantly changed, and therefore, before performing audio conversion, the server needs to acquire the first speaker information and the first audio stream data corresponding to the first speaker information, so that after performing audio conversion, which speaker corresponds to the first text stream data obtained by conversion can be determined.

Specifically, after the server acquires the first speaker information and first audio stream data corresponding to the first speaker information, the first audio stream data is converted into first text stream data, and a first speaker identifier corresponding to the first text stream data is determined according to the first speaker information.

The speaker identification is a unique identity configured by the server for each speaker in the live broadcast room. Since the speaker is not necessarily a user of the webcast platform, in the embodiment of the present application, the first speaker information is bound to the first speaker identifier.

The server can confirm the corresponding speaker identification according to the first speaker information.

Specifically, the server acquires first speaker information and speaker information corresponding to a plurality of speaker identifications; and if the first speaker information is matched with the speaker information corresponding to any speaker identifier, determining that the first speaker identifier corresponding to the first text stream data is the speaker identifier.

The speaker id corresponds to a pre-trained audio generation model, that is, the speaker ids configured by the server all have corresponding pre-trained audio generation models, and if only the speaker id is configured for the speaker and the pre-trained audio generation model corresponding to the speaker id is not trained, the voice of the speaker cannot be imitated, so that audio replacement is performed.

Under the video direct broadcasting, the first speaker information is face information corresponding to a current speaker in a live broadcasting room, and if the face information corresponding to the current speaker is matched with the face information corresponding to any speaker identifier, the speaker identifier is the first speaker identifier corresponding to the first speaker information.

Under the voice direct broadcasting, the first speaker information is the wheat sequence information corresponding to the current speaker in the live broadcasting room, and if the wheat sequence information corresponding to the current speaker is matched with the wheat sequence information corresponding to any speaker identifier in the same live broadcasting room, the speaker identifier is the first speaker identifier corresponding to the first speaker information.

And then, the server sends the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room, and obviously, the live broadcast room is the live broadcast room corresponding to the live broadcast room identifier.

It should be noted that, in an alternative embodiment, the audio conversion instruction is generated after the pre-training of the audio generation model, because only after the pre-training of the audio generation model, the converted first text stream data can be used by the viewer client to generate the second audio stream data, so as to implement the audio replacement.

In the following, how to determine which position a current speaker in the live broadcast room is, and how to determine the first speaker information and the first audio stream data corresponding to the first speaker information in the live broadcast room will be described in detail.

In an optional embodiment, the anchor client directly sends the acquired audio/video stream data to the server, and the server determines the first speaker information and the first audio stream data corresponding to the first speaker information according to the audio/video stream data acquired by the anchor client.

Specifically, referring to fig. 5, fig. 5 is a schematic flowchart of S101 in a method for replacing audio in a live broadcast room according to a first embodiment of the present application, where in S101, first audio stream data corresponding to first speaker information and first speaker information is obtained, and the method includes:

s1011: the server receives audio and video streaming data collected by the anchor client; the audio and video stream data comprises audio stream data and video stream data; the video stream data comprises a plurality of frames of video pictures.

S1012: the server carries out face positioning in the video picture to obtain at least one face area, and monitors the speaking action in the face area in the video picture to determine the current speaker and the face area corresponding to the current speaker.

S1013: the server acquires face information corresponding to the current speaker as first speaker information according to the video picture and the face area corresponding to the current speaker.

S1014: the server acquires that the currently received audio stream data is first audio stream data corresponding to the first speaker information.

In step S1012, the server locates the face in the video frame in real time to obtain at least one face area. The face location algorithm specifically adopted by the server is not limited herein, and for example: and carrying out face positioning by adopting a YOLO (You Only Look one) neural network algorithm.

And then, the server monitors the speaking action in the face area and determines the current speaker and the face area corresponding to the current speaker. The speaking action monitoring algorithm specifically adopted by the server is not limited herein, and for example: the server can judge whether the speaker is speaking by monitoring the mouth movement in the face area.

Wherein, the face area is determined by the face position and the face size.

In step S1013, the server acquires, according to the video frame and the face area corresponding to the current speaker, that the face information corresponding to the current speaker is the first speaker information.

Specifically, the server may capture a video frame according to the face position and the face size, and acquire face information corresponding to the current speaker, that is, acquire the first speaker information.

The face information may refer to a face image, or may refer to a feature vector extracted from the face image.

In an alternative embodiment, the S1012 server performs face positioning in the video frame to obtain at least one face region, and includes the steps of: the server performs living body detection in the face area in the video picture, and obtains the face area corresponding to the speaker in the live broadcast room and the number of speakers in the live broadcast room.

The biopsy algorithm is not limited herein, and may be any biopsy algorithm, such as: and (4) judging whether the detection algorithm is a living body by monitoring blinking actions.

By performing living body detection in the face area in the video picture, the face area corresponding to the speaker in the live broadcast room and the number of speakers in the live broadcast room can be obtained, and the face displayed in the face area corresponding to the speaker is a real face.

If the number of speakers in the live broadcast room is not less than two, in S1012, the speaking action is monitored in the face area in the video picture, and the current speaker and the face area corresponding to the current speaker are determined, including the steps of: the server monitors the speaking action in the face area corresponding to the speaker in the video picture, and determines the current speaker and the face area corresponding to the current speaker.

At this time, the server only needs to monitor the speaking action in the face area corresponding to the speaker, so that the computing resources can be saved to a certain extent.

If the number of the speakers in the live broadcast room is one, in S1012, the speaking action is monitored in the face area in the video frame, and the current speaker and the face area corresponding to the current speaker are determined, including the steps of: the server determines that a speaker in the live broadcast room is a current speaker, and a face area corresponding to the speaker in the live broadcast room is a face area corresponding to the current speaker.

Because there is only one speaker in the live broadcast room, the speaking action does not need to be monitored, so that the computing resources can be further saved, and the speed of acquiring the face area corresponding to the current speaker is improved.

In this embodiment, the server can determine the current speaker in the live broadcast room in real time by face positioning and detection of a speaking action, and obtain face information corresponding to the current speaker, that is, obtain first speaker information, so that the first audio stream data corresponding to the first speaker information and the first speaker information can be accurately obtained by associating the currently received audio stream data with the first speaker information, that is, associating the first audio stream data with the first speaker information.

In another optional embodiment, the anchor client may determine, in real time, the first speaker information and the first audio stream data corresponding to the first speaker information according to the acquired audio/video stream data, and then send the first speaker information and the first audio stream data corresponding to the first speaker information to the server.

Specifically, referring to fig. 6, fig. 6 is another schematic flow chart of an audio replacing method in a live broadcast room according to a first embodiment of the present application, where before the S101 server responds to an audio conversion instruction, the method includes the steps of:

s104: the anchor client responds to the broadcasting instruction and collects audio and video stream data; the audio and video stream data comprises audio stream data and video stream data; the video stream data comprises a plurality of frames of video pictures.

S105: the anchor client carries out face positioning in the video picture to obtain at least one face area, monitors speaking actions in the face area in the video picture and determines a current speaker and the face area corresponding to the current speaker.

S106: and the anchor client acquires the face information corresponding to the current speaker as the first speaker information according to the video picture and the face area corresponding to the current speaker.

S107: the anchor client acquires currently acquired audio stream data as first audio stream data corresponding to the first speaker information.

In step S104, the broadcast command is generated in response to the broadcast control triggering command, and the anchor client starts to acquire audio/video stream data in real time in response to the broadcast command.

Steps S105 to S107 are the same execution flow as steps S1012 to S1014, except that the execution subject is an anchor client. Explanation is not repeated about related concepts, and reference may be made to the description of steps S1012 to S1014.

In an optional embodiment, after the anchor client performs face localization in a video picture to obtain at least one face region, the method includes the steps of: and the anchor client performs living body detection in a face area in the video picture to acquire the face area corresponding to the speaker in the live broadcast room and the number of speakers in the live broadcast room.

If the number of the speakers in the live broadcast room is not less than two, the speaking action is monitored in the face area in the video picture, and the current speaker and the face area corresponding to the current speaker are determined, which comprises the following steps: the anchor client monitors the speaking action in the face area corresponding to the speaker in the video picture, and determines the current speaker and the face area corresponding to the current speaker.

At this time, the anchor client only needs to monitor the speaking action in the face area corresponding to the speaker, so that the computing resources can be saved to a certain extent.

If the number of the speakers in the live broadcast room is one, the speaking action is monitored in the face area in the video picture, and the current speaker and the face area corresponding to the current speaker are determined, which comprises the following steps: the anchor client determines that a speaker in the live broadcast room is a current speaker, and a face area corresponding to the speaker in the live broadcast room is a face area corresponding to the current speaker.

Regarding step S102, the viewer client, in response to the audio generation instruction, acquires the first text stream data and the first speaker identifier corresponding to the first text stream data; and inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data.

The pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier.

The pre-trained audio generation model corresponding to the first speaker identification can imitate the sound of the current speaker and generate second audio stream data according to the first text stream data.

The specific training process on how the audio generation instructions and the audio generation model are generated will be described later.

With respect to step S103, the viewer client replaces the first audio stream data output in the live broadcast to the second audio stream data.

The first audio stream data output in the live broadcast refers to the audio stream data currently output in the live broadcast.

In an alternative embodiment, since the second audio stream data generated from the first text stream data is output by the viewer client, an audio pause issue command may be sent to the server, so that the server stops issuing the audio stream data in response to the audio pause issue command.

In the embodiment of the application, in the process of issuing audio stream data to a viewer client in real time, a server responds to an audio conversion instruction to obtain first speaker information (namely speaker information corresponding to a current speaker in a live broadcast room) and first audio stream data corresponding to the first speaker information, then the first audio stream data is converted into first text stream data, and a first speaker identifier corresponding to the first text stream data is determined according to the first speaker information, so that the first text stream data and the first speaker identifier corresponding to the first text stream data are sent to the viewer client in the live broadcast room, and further when the viewer client responds to an audio generation instruction, the first text stream data can be input to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data, and then the first audio stream data currently output in the live broadcast room is replaced by the second audio data, therefore, under the situations of audio blocking and the like, audiences can listen to the second audio data simulating the sound of the current speaker in time, the influence of the blocking phenomenon on the live broadcasting experience of the audiences is greatly reduced, and the currently output first audio stream data in the live broadcasting room is replaced by the second audio data, so that the server can pause the issuing of the audio stream data, the purposes of saving flow and reducing equipment load are achieved, the live broadcasting experience of the audiences is further improved, and the retention rate and the watching duration of the audiences are improved.

The following is a description of the training process of the audio generation model. In an alternative embodiment, the audio generation model and the audio identification model form an anti-neural network model, referring to fig. 7, before the S101 server responds to the audio transformation command, the method includes the steps of:

s108: the server responds to the model training instruction and obtains audio stream training data corresponding to the speaker identification and text stream training data corresponding to the speaker identification; the text stream training data corresponding to the speaker identifier is obtained by converting the audio stream training data corresponding to the speaker identifier.

S109: the server carries out countermeasure training on the audio generation model corresponding to the speaker identification and the audio identification model corresponding to the speaker identification according to the audio stream training data corresponding to the speaker identification and the text stream training data corresponding to the speaker identification, and a pre-trained audio generation model corresponding to the speaker identification and a pre-trained audio identification model corresponding to the speaker identification are obtained.

In this embodiment, the training data used for training the audio generation model and the audio identification model includes audio stream training data corresponding to the speaker id and text stream training data corresponding to the speaker id. And the audio stream training data and the text stream training data corresponding to different speaker identifications are used for training audio generation models and audio identification models corresponding to different speaker identifications.

Specifically, the step S108 of acquiring the audio stream training data corresponding to the speaker identifier includes the steps of: the server receives audio and video stream training data collected by the anchor client; the audio and video stream training data comprises audio stream training data and video stream training data; the video stream training data includes a plurality of frames of video training pictures. The method comprises the steps that a server carries out face positioning in a video training picture to obtain at least one face area, speech actions are monitored in the face area in the video training picture, and a current speaker, face information corresponding to the current speaker and audio stream training data corresponding to the current speaker are determined, so that the face information corresponding to the at least one speaker and the audio stream training data corresponding to the at least one speaker are obtained; and the face information corresponding to the current speaker is obtained according to the video training picture and the face area corresponding to the current speaker. The server configures a speaker identifier corresponding to at least one speaker to obtain face information corresponding to the speaker identifier and audio stream training data corresponding to the speaker identifier.

It is known that a server needs to collect a large amount of audio stream training data corresponding to speaker identifiers to pre-train an audio generation model.

Therefore, the server needs to continuously determine the current speaker, the face information corresponding to the current speaker, and the audio stream training data corresponding to the current speaker through face positioning and speech motion monitoring, and finally can obtain the face information corresponding to at least one speaker and the audio stream training data corresponding to at least one speaker.

Reference is made to the preceding description for specific ways of face localization and speech activity monitoring.

Similarly, the server can also perform living body detection in the face area in the video training picture to acquire the face area and the number of speakers corresponding to the speakers.

The biopsy algorithm is not limited herein, and may be any one of the biopsy algorithms, for example: and judging whether the eye is a living body detection algorithm or not by monitoring the blinking motion.

If the number of the speakers is not less than two, the speech action is monitored in the face area in the video training picture, the current speaker, the face information corresponding to the current speaker and the audio stream training data corresponding to the current speaker are determined, and the method comprises the following steps: the server monitors the speaking action in the face area corresponding to the speaker in the video training picture, and determines the current speaker, the face information corresponding to the current speaker and the audio stream training data corresponding to the current speaker.

If the number of speakers in the live broadcast room is one, the speaking action is monitored in a face area in a video training picture, the current speaker, face information corresponding to the current speaker and audio stream training data corresponding to the current speaker are determined, and the method comprises the following steps: the server determines that a speaker in the video training picture is the current speaker, and obtains face information corresponding to the current speaker and audio stream training data corresponding to the current speaker.

Because there is only one speaker in the live broadcast room, the speaking action does not need to be monitored, so that the computing resources can be further saved, and the speed of acquiring the face information corresponding to the current speaker is improved.

The following describes in detail the countermeasure training process of the audio generation model corresponding to the speaker id and the audio identification model corresponding to the speaker id, and specifically, S109 includes the steps of:

s1091: and the server inputs the text stream training data corresponding to the speaker identification into the audio generation model corresponding to the speaker identification, and acquires the virtual audio stream training data corresponding to the speaker identification.

S1092: the server iteratively trains the audio identification model corresponding to the speaker identifier according to the audio stream training data corresponding to the speaker identifier, the virtual audio stream training data corresponding to the speaker identifier, a preset first loss function and a preset first model optimization algorithm, and optimizes trainable parameters in the audio identification model corresponding to the speaker identifier until the value of the first loss function meets a preset first training termination condition, so that the currently trained audio identification model corresponding to the speaker identifier is obtained.

S1093: and the server modifies the label of the virtual audio stream training data into true, inputs the virtual audio stream training data into the currently trained audio identification model corresponding to the speaker identification, and acquires the identification result of the virtual audio stream training data.

S1094: and if the identification result of the virtual audio stream training data meets a preset second training termination condition, the server obtains a pre-trained audio generation model corresponding to the speaker identification and a pre-trained audio identification model corresponding to the speaker identification.

S1095: if the identification result of the virtual audio stream training data does not meet the preset second training termination condition, the server obtains a value of a second loss function according to the identification result of the virtual audio stream training data, the label of the virtual audio stream training data and a preset second loss function, and optimizes trainable parameters of the audio generation model corresponding to the speaker identification according to the value of the second loss function and a preset second model optimization algorithm to obtain a currently trained audio generation model corresponding to the speaker identification.

S1096: the server inputs the text stream training data corresponding to the speaker identifier to the currently trained audio generation model corresponding to the speaker identifier, reacquires the virtual audio stream training data corresponding to the speaker identifier, and repeatedly executes the steps of iteratively training the audio identification model corresponding to the speaker identifier and optimizing the trainable parameters of the audio generation model corresponding to the speaker identifier until the identification result of the virtual audio stream training data meets a preset second training termination condition, so as to obtain the pre-trained audio generation model corresponding to the speaker identifier and the pre-trained audio identification model corresponding to the speaker identifier.

In step S1091, the server inputs the text stream training data corresponding to the speaker identifier to the audio generation model corresponding to the speaker identifier, and acquires the virtual audio stream training data corresponding to the speaker identifier.

And the audio generation model corresponding to the speaker identification is the audio generation model after random initialization.

In step S1092, the label of the audio stream training data corresponding to the speaker id is true, and the label of the virtual audio stream training data corresponding to the speaker id is false. The server obtains an identification result of the audio stream training data corresponding to the speaker identifier and an identification result of the virtual audio stream training data corresponding to the speaker identifier by inputting the audio stream training data corresponding to the speaker identifier and the virtual audio stream training data corresponding to the speaker identifier into an audio identification model corresponding to the speaker identifier, calculates a value of a first loss function according to the identification result of the audio stream training data corresponding to the speaker identifier, the identification result of the virtual audio stream training data corresponding to the speaker identifier and a preset first loss function, obtains a currently trained audio identification model corresponding to the speaker identifier if the value of the first loss function satisfies a preset first training termination condition, and optimizes an algorithm according to the value of the first loss function and the preset first model if the value of the first loss function does not satisfy the preset first training termination condition, and optimizing trainable parameters of the audio identification model corresponding to the speaker identification, and repeating the steps until the value of the first loss function meets a preset first training termination condition to obtain the currently trained audio identification model corresponding to the speaker identification.

Regarding steps S1093 to S1095, the server modifies the label of the virtual audio stream training data to true, and inputs the virtual audio stream training data to the currently trained audio identification model corresponding to the speaker identifier, so as to obtain the identification result of the virtual audio stream training data. And if the identification result of the virtual audio stream training data meets a preset second training termination condition, the server obtains a pre-trained audio generation model corresponding to the speaker identification and a pre-trained audio identification model corresponding to the speaker identification. If the identification result of the virtual audio stream training data does not meet the preset second training termination condition, the server obtains a value of a second loss function according to the identification result of the virtual audio stream training data, the label of the virtual audio stream training data and a preset second loss function, and optimizes trainable parameters of the audio generation model corresponding to the speaker identification according to the value of the second loss function and a preset second model optimization algorithm to obtain a currently trained audio generation model corresponding to the speaker identification.

In the audio identification model of this embodiment, when the probability that the virtual audio stream training data corresponding to the speaker identifier is judged to be true is in the vicinity of 0.5, it means that the audio generation model corresponding to the speaker identifier and the audio identification model corresponding to the speaker identifier achieve a relatively good countermeasure training effect. Therefore, the preset second training termination condition is an interval around 0.5, and when the identification result of the virtual audio stream training data corresponding to the speaker identifier is in the interval, the identification result of the virtual audio stream training data corresponding to the speaker identifier meets the preset second training termination condition.

When the judgment result of the virtual audio stream training data corresponding to the speaker identifier is biased to 0, the probability that the virtual audio stream training data is considered to be true by the audio identification model corresponding to the speaker identifier is close to 0, which means that the virtual audio stream training data corresponding to the speaker identifier generated by the audio generation model corresponding to the speaker identifier is easy to recognize, and the generation effect of the audio generation model corresponding to the speaker identifier is poor. Because the label of the virtual audio stream training data is modified to be true, namely 1, the value of the obtained second loss function is larger according to the identification result of the virtual audio stream training data, the label of the virtual audio stream training data and the preset second loss function, and the trainable parameters of the audio generation model corresponding to the speaker identifier can be greatly optimized based on the value of the second loss function and the preset second model optimization algorithm, so that the currently trained audio generation model is obtained.

When the judgment result of the virtual audio stream training data corresponding to the speaker identifier is biased to 1, the probability that the audio identification model corresponding to the speaker identifier considers that the virtual audio stream training data is true is close to 1, which means that the identification effect of the audio identification model corresponding to the speaker identifier is poor, and therefore, the audio identification model corresponding to the speaker identifier needs to be trained continuously.

Regarding step S1096, the server inputs the text stream training data corresponding to the speaker identifier to the currently trained audio generation model corresponding to the speaker identifier, re-acquires the virtual audio stream training data corresponding to the speaker identifier, and repeatedly executes the step of iteratively training the audio identification model corresponding to the speaker identifier and the step of optimizing the trainable parameters of the audio generation model corresponding to the speaker identifier until the identification result of the virtual audio stream training data satisfies the preset second training termination condition, so as to obtain the pre-trained audio generation model corresponding to the speaker identifier and the pre-trained audio identification model corresponding to the speaker identifier.

The first loss function, the second loss function, the first model optimization algorithm, and the second model optimization algorithm are not limited herein, and may be any one of the existing loss functions and neural network model optimization algorithms.

In this embodiment, the antagonistic neural network is formed by the audio generation network corresponding to the speaker identifier and the audio identification network corresponding to the speaker identifier, and joint training is performed on the audio generation network corresponding to the speaker identifier and the audio identification network corresponding to the speaker identifier, so that the sound of the speaker corresponding to the speaker identifier can be better simulated, and the live broadcast experience of the audience is further improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of an audio replacing method in a live broadcast room according to a second embodiment of the present application, including the following steps:

s201: the server responds to the audio conversion instruction, and obtains first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room.

S202: the audience client side responds to the audio and video pause instruction, judges whether the first text stream data issued by the server can be received or not, and if yes, the audience client side generates an audio generation instruction; the audio/video pause instruction is generated when the audience client judges that the audio/video stream data issued by the server cannot be received or the audio/video stream data cannot be analyzed.

S203: the method comprises the steps that a spectator client responds to an audio generation instruction to obtain first text stream data and a first speaker identifier corresponding to the first text stream data; inputting first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier.

S204: the viewer client replaces the first audio stream data output in the live broadcast with the second audio stream data.

In this embodiment, steps S201, S203, and S204 are the same as steps S101 to S103 in the first embodiment, and specific reference may be made to the related description of the first embodiment. Step S202 illustrates when the viewer client is triggered to generate audio generation instructions.

Specifically, the audience client responds to the audio and video pause instruction, judges whether first text stream data issued by the server can be received or not, and if yes, the audience client generates an audio generation instruction.

The audio/video pause instruction is generated when the audience client judges that the audio/video stream data issued by the server cannot be received or the audio/video stream data cannot be analyzed.

That is to say, during the audio/video card pause, if the first text stream data issued by the server can also be received at this time, the voice of the speaker can be simulated in an audio replacement manner to speak, so that the influence on the live broadcast experience of the audience is avoided.

In an alternative embodiment, step S202 may be replaced by steps S205 to S206, and referring to fig. 9, steps S205 to S206 are specifically as follows:

s205: the method comprises the steps that a spectator client side responds to an audio and video province stream instruction to obtain first text stream data; the first text stream data comprises a plurality of pieces of text information and time sequence information corresponding to each piece of text information.

S206: and the audience client determines a first sentence break time according to the time sequence information corresponding to each piece of text information, generates an audio generation instruction and an audio pause issuing instruction when the current time reaches the first sentence break time, and sends the audio pause issuing instruction to the server.

In this embodiment, when the viewer client responds to the audio/video stream saving instruction, it may acquire the first text stream data.

The first text stream data comprises a plurality of pieces of text information and time sequence information corresponding to each piece of text information.

And then, the audience client determines the first sentence break time according to the time sequence information corresponding to each piece of text information, so that when the current time reaches the first sentence break time, an audio generation instruction and an audio pause issuing instruction are generated, and the audio pause issuing instruction is sent to the server, so that the server responds to the audio pause issuing instruction and stops issuing audio stream data.

In this embodiment, because the generated audio generation instruction and the audio pause issue instruction are issued when the current time reaches the first sentence break time, audio replacement can be performed in the speech gap of the current speaker, and the influence of an audio replacement link on the live broadcast experience of audiences can be further reduced.

Under certain situations, the audience client can send an audio and video pause issuing instruction to the server, so that the server responds to the audio and video pause issuing instruction and stops issuing audio and video stream data, and the flow is further saved.

In an optional embodiment, the audio/video stream saving instruction is generated when the viewer client responds to a stream saving opening operation instruction, a playing form switching instruction, a live application background running instruction or a device screen saving instruction.

The province flow opening operation instruction may be generated by the audience client when judging that the audience opens the province flow.

The playing form switching instruction may be generated when the audience client determines that the audience plays the live content in a preset playing form, where the preset playing form may be a small window playing, a voice bar playing, and the like, and is not limited in detail herein.

The live application background operation instruction may be generated when the viewer client determines that the viewer switches the live application to background operation.

The device screen-off command may be generated when the viewer client determines that the viewer screens off the device.

In this embodiment, the spectator client is triggered to generate the audio and video stream-saving instruction under various different conditions, so that the purposes of saving traffic and reducing equipment load can be achieved, the live broadcast experience of spectators is further improved, and the retention rate and the watching duration of spectators are improved.

In another alternative embodiment, step S202 may be replaced by steps S207 to S209, and referring to fig. 10, steps S207 to S209 may be as follows:

s207: a spectator client acquires a target video stream through a camera; wherein the target video stream comprises a plurality of frames of target video pictures.

S208: and the audience client carries out face positioning in the target video picture, and generates a sight tracking instruction if the face exists in the target video picture.

S209: the audience client responds to the sight tracking instruction, acquires a sight staying position and a display area corresponding to the live broadcast room interface, judges whether the sight staying position is in the display area corresponding to the live broadcast room interface, if not, acquires a sight moving-out time length, generates an audio generation instruction and an audio and video pause issuing instruction when the sight moving-out time length exceeds a preset first time length, and sends the audio and video pause issuing instruction to the server.

The camera may be a self-contained camera of the viewer client. The target video stream is the video stream collected by the camera. And the audience client performs face positioning based on a plurality of frames of target video pictures in the target video stream, and generates a sight tracking instruction if the face exists in the target video pictures, so that the audience client responds to the sight tracking instruction to acquire a sight staying position and a display area corresponding to a live broadcast room interface.

The display area corresponding to the live broadcast interface refers to a display area of the live broadcast interface in a display screen of the audience client.

The acquisition of the gaze dwell position, which refers to the dwell position of the viewer's gaze on the viewer's client's display screen, may employ existing gaze location algorithms.

And then, the audience client judges whether the sight staying position is in a display area corresponding to the live broadcast room interface, if not, the sight moving-out time length is obtained, when the sight moving-out time length exceeds a preset first time length, an audio generation instruction and an audio and video pause issuing instruction are generated, and the audio and video pause issuing instruction is sent to the server.

That is, if the viewer does not watch the live broadcasting room interface for a long time, it is confirmed that the viewer is not watching the live broadcasting, and therefore, the province stream is turned on by default, and an audio generation instruction is issued.

In this embodiment, through carrying out real-time tracking to spectator's sight to can be when spectator does not watch the live condition for a long time, acquiesce and open the province and flow, thereby can reach the purpose of saving the flow and reducing the equipment load, further improve spectator's live experience, improve spectator's retention rate and watch for a long time.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an audio replacing system in a live broadcast room according to a third embodiment of the present application, where the system 11 includes: a server 111 and a viewer client 112;

the server 111 is configured to respond to an audio conversion instruction, and acquire first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to the first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to the audience client 112 in the live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room;

the viewer client 112 is configured to, in response to an audio generation instruction, obtain the first text stream data and a first speaker identifier corresponding to the first text stream data; inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier;

the viewer client 112 is configured to replace the first audio stream data output in the live broadcast with the second audio stream data.

The audio replacing system in the live broadcast room and the audio replacing method in the live broadcast room provided by the above embodiments belong to the same concept, and details of implementation processes are shown in the method embodiments and are not described herein again.

Please refer to fig. 12, which is a schematic structural diagram of an audio replacing apparatus in a live broadcast room according to a fourth embodiment of the present application. The apparatus may be implemented as all or part of a computer device in software, hardware, or a combination of both. The apparatus 12 comprises:

a first conversion unit 121, configured to, in response to an audio conversion instruction, a server obtain first speaker information and first audio stream data corresponding to the first speaker information; converting the first audio stream data into first text stream data; determining a first speaker identifier corresponding to the first text stream data according to the first speaker information, and sending the first text stream data and the first speaker identifier corresponding to the first text stream data to a viewer client in a live broadcast room; the first speaker information is speaker information corresponding to a current speaker in the live broadcast room;

a first generating unit 122, configured to, in response to an audio generating instruction, the viewer client obtain the first text stream data and a first speaker identifier corresponding to the first text stream data; inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier;

a first replacing unit 123, configured to replace, by the viewer client, the first audio stream data output in the live broadcast into the second audio stream data.

It should be noted that, when the audio replacing apparatus in the live broadcast room provided in the foregoing embodiment executes the audio replacing method in the live broadcast room, the division of the functional modules is merely used as an example, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio replacing device in the live broadcast room and the audio replacing method in the live broadcast room provided by the above embodiments belong to the same concept, and details of implementation processes are shown in the method embodiments and are not described herein again.

Please refer to fig. 13, which is a schematic structural diagram of a computer device according to a fifth embodiment of the present application. As shown in fig. 13, the computer device 13 may include: a processor 130, a memory 131, and a computer program 132 stored in the memory 131 and executable on the processor 130, such as: an audio replacement program in the live broadcast room; the steps of the first to second embodiments are implemented when the processor 130 executes the computer program 132.

The processor 130 may include one or more processing cores, among other things. The processor 130 is connected to various parts in the computer device 13 by various interfaces and lines, executes various functions of the computer device 13 and processes data by operating or executing instructions, programs, code sets or instruction sets stored in the memory 131 and calling data in the memory 131, and optionally, the processor 130 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), Programmable Logic Array (PLA). The processor 130 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing contents required to be displayed by the touch display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 130, but may be implemented by a single chip.

The Memory 131 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 131 includes a non-transitory computer-readable medium. The memory 131 may be used to store instructions, programs, code sets or instruction sets. The memory 131 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions, etc.), instructions for implementing the above-described method embodiments, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 131 may optionally be at least one storage device located remotely from the processor 130.

The embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps of the foregoing embodiment, and a specific execution process may refer to specific descriptions of the foregoing embodiment, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described apparatus/terminal device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one type of logic function, and another division manner may be provided in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above-described embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc.

The present invention is not limited to the above-described embodiments, and various modifications and variations of the present invention are intended to be included within the scope of the claims and the equivalent technology of the present invention if they do not depart from the spirit and scope of the present invention.

Claims

1. A method for audio substitution in a live room, the method comprising the steps of:

the audience client side responds to an audio generation instruction to obtain the first text stream data and a first speaker identifier corresponding to the first text stream data; inputting the first text stream data to a pre-trained audio generation model corresponding to the first speaker identifier to obtain second audio stream data; the pre-trained audio generation model corresponding to the first speaker identifier is obtained by training according to audio stream training data corresponding to the first speaker identifier and text stream training data corresponding to the first speaker identifier, and the text stream training data corresponding to the first speaker identifier is obtained by converting the audio stream training data corresponding to the first speaker identifier;

2. A method of audio substitution in a live broadcast room according to claim 1, characterized by: the first speaker information is face information corresponding to the current speaker in the live broadcast room under video direct broadcast, and the first speaker information is wheat sequence information corresponding to the current speaker in the live broadcast room under voice direct broadcast.

3. The method for replacing audio in a live broadcast room according to claim 1, wherein the step of obtaining first audio stream data corresponding to first speaker information and the first speaker information comprises the steps of:

the server receives audio and video streaming data collected by the anchor client; the audio and video stream data comprise audio stream data and video stream data; the video stream data comprises a plurality of frames of video pictures;

the server carries out face positioning in the video picture to obtain at least one face area, monitors speaking actions in the face area in the video picture and determines the current speaker and the face area corresponding to the current speaker;

the server acquires the face information corresponding to the current speaker as the first speaker information according to the video picture and the face area corresponding to the current speaker;

and the server acquires the currently received audio stream data as first audio stream data corresponding to the first speaker information.

4. The method as claimed in claim 3, wherein the server performs face location in the video frame to obtain at least one face region, and comprises the following steps:

the server performs living body detection in the face area in the video picture to acquire the face area corresponding to the speaker in the live broadcast room and the number of speakers in the live broadcast room;

if the number of the speakers in the live broadcast room is not less than two, monitoring the speaking action in the face area in the video picture, and determining the current speaker and the face area corresponding to the current speaker, including the following steps:

the server monitors a speaking action in a face area corresponding to the speaker in the video picture, and determines the current speaker and the face area corresponding to the current speaker;

if the number of the speakers in the live broadcast room is one, monitoring the speaking action in the face area in the video picture, and determining the current speaker and the face area corresponding to the current speaker, including the steps of:

the server determines that a speaker in the live broadcast room is the current speaker, and a face area corresponding to the speaker in the live broadcast room is a face area corresponding to the current speaker.

5. The method for audio substitution in a live broadcast room of claim 1, wherein the server responds to the audio conversion instruction and then comprises the following steps:

the anchor client responds to the broadcasting command and collects audio and video stream data; the audio and video stream data comprise audio stream data and video stream data; the video stream data comprises a plurality of frames of video pictures;

the anchor client carries out face positioning in the video picture to obtain at least one face area, monitors speaking actions in the face area in the video picture and determines the current speaker and the face area corresponding to the current speaker;

the anchor client acquires the face information corresponding to the current speaker as the first speaker information according to the video picture and the face area corresponding to the current speaker;

and the anchor client acquires the currently acquired audio stream data as first audio stream data corresponding to the first speaker information.

6. The audio replacing method in the live broadcast room according to claim 5, wherein said anchor client performs face positioning in said video frame to obtain at least one face area, and comprises the steps of:

the anchor client performs living body detection in the face area in the video picture to acquire the face area corresponding to the speaker in the live broadcast room and the number of speakers in the live broadcast room;

if the number of the speakers in the live broadcast room is not less than two, monitoring the speaking action in the face area in the video picture, and determining the current speaker and the face area corresponding to the current speaker, comprising the following steps:

the anchor client monitors a speaking action in a face area corresponding to the speaker in the video picture, and determines the current speaker and the face area corresponding to the current speaker;

and the anchor client determines that a speaker in the live broadcast room is the current speaker, and a face area corresponding to the speaker in the live broadcast room is a face area corresponding to the current speaker.

7. The method for replacing audio in a live broadcast room according to any one of claims 1 to 6, wherein the determining a first speaker identifier corresponding to the first text stream data according to the first speaker information comprises:

the server acquires the first speaker information and speaker information corresponding to a plurality of speaker identifications; the speaker identification corresponds to the pre-trained audio generation model;

and if the first speaker information is matched with the speaker information corresponding to any one speaker identifier, determining that the first speaker identifier corresponding to the first text stream data is the speaker identifier.

8. The method of audio substitution in a live broadcast room of any one of claims 1 to 6, wherein the viewer client, in response to the audio generation instruction, is preceded by the steps of:

the audience client responds to the audio and video pause instruction, judges whether the first text stream data issued by the server can be received or not, and if so, the audience client generates the audio generation instruction; the audio/video pause instruction is generated when the audience client judges that the audio/video stream data issued by the server cannot be received or cannot be analyzed.

9. The method of audio substitution in a live broadcast room of any one of claims 1 to 6, wherein the viewer client, in response to the audio generation instruction, is preceded by the steps of:

the audience client side responds to the audio and video province stream instruction to acquire the first text stream data; the first text stream data comprises a plurality of pieces of text information and time sequence information corresponding to each piece of text information;

and the audience client determines first sentence break time according to the time sequence information corresponding to each piece of text information, generates the audio generation instruction and the audio pause sending instruction when the current time reaches the first sentence break time, and sends the audio pause sending instruction to the server.

10. The method for replacing audio in a live broadcast room according to claim 9, wherein the audio/video live-view instruction is generated when the viewer client responds to a live-view start operation instruction, a play format switching instruction, a live-view application background operation instruction or a device screen-turning instruction.

11. The method of audio substitution in a live broadcast room of any one of claims 1 to 6, wherein the viewer client, in response to the audio generation instruction, is preceded by the steps of:

the audience client side collects a target video stream through a camera; wherein the target video stream comprises a plurality of frames of target video pictures;

the audience client carries out face positioning in the target video picture, and if a face exists in the target video picture, a sight tracking instruction is generated;

and the audience client responds to the sight tracking instruction, acquires a sight staying position and a display area corresponding to a live broadcast room interface, judges whether the sight staying position is in the display area corresponding to the live broadcast room interface, acquires a sight moving-out time length if the sight staying position is not in the display area corresponding to the live broadcast room interface, generates an audio generation instruction and an audio and video pause issuing instruction when the sight moving-out time length exceeds a preset first time length, and sends the audio and video pause issuing instruction to the server.

12. The audio replacing method in the live broadcast room according to any one of claims 1 to 6, wherein the audio generation model and the audio identification model constitute a confrontation neural network model, and the server responds to the audio conversion instruction and comprises the following steps:

the server responds to a model training instruction and obtains audio stream training data corresponding to a speaker identifier and text stream training data corresponding to the speaker identifier; the text stream training data corresponding to the speaker identification is obtained by converting audio stream training data corresponding to the speaker identification;

the server carries out countermeasure training on the audio generation model corresponding to the speaker identification and the audio identification model corresponding to the speaker identification according to the audio stream training data corresponding to the speaker identification and the text stream training data corresponding to the speaker identification, and obtains the pre-trained audio generation model corresponding to the speaker identification and the pre-trained audio identification model corresponding to the speaker identification.

13. The method for replacing audio in a live broadcast room according to claim 12, wherein said obtaining of audio stream training data corresponding to a speaker identifier comprises the steps of:

the server receives audio and video stream training data collected by the anchor client; wherein the audio and video stream training data comprises the audio stream training data and video stream training data; the video stream training data comprises a plurality of frames of video training pictures;

the server carries out face positioning in the video training picture to obtain at least one face area, monitors speaking actions in the face area in the video training picture, determines the current speaker, face information corresponding to the current speaker and audio stream training data corresponding to the current speaker, and obtains at least one piece of face information corresponding to the speaker and at least one piece of audio stream training data corresponding to the speaker; the face information corresponding to the current speaker is obtained according to the video training picture and a face area corresponding to the current speaker;

and the server configures at least one speaker identifier corresponding to the speaker to obtain the face information corresponding to the speaker identifier and the audio stream training data corresponding to the speaker identifier.

14. The method as claimed in claim 13, wherein the server performs face location in the video training picture to obtain at least one face region, and comprises the following steps:

the server performs living body detection in the face area in the video training picture to acquire the face area and the number of speakers corresponding to the speakers;

if the number of the speakers is not less than two, monitoring a speaking action in the face area in the video training picture, and determining the current speaker, face information corresponding to the current speaker and audio stream training data corresponding to the current speaker, including the steps of:

the server monitors a speech action in a face area corresponding to the speaker in the video training picture, and determines the current speaker, face information corresponding to the current speaker and audio stream training data corresponding to the current speaker;

if the number of the speakers in the live broadcast room is one, monitoring the speaking action in the face area in the video training picture, and determining the current speaker, the face information corresponding to the current speaker and the audio stream training data corresponding to the current speaker, including the steps of:

and the server determines that a speaker in the video training picture is the current speaker, and obtains face information corresponding to the current speaker and audio stream training data corresponding to the current speaker.

15. The method as claimed in claim 12, wherein the server performs countermeasure training on the audio generation model corresponding to the speaker id and the audio identification model corresponding to the speaker id according to audio stream training data corresponding to the speaker id and text stream training data corresponding to the speaker id to obtain a pre-trained audio generation model corresponding to the speaker id and a pre-trained audio identification model corresponding to the speaker id, and the method comprises the steps of:

the server inputs the text stream training data corresponding to the speaker identification into the audio generation model corresponding to the speaker identification, and virtual audio stream training data corresponding to the speaker identification is obtained;

the server iteratively trains the audio identification model corresponding to the speaker identifier according to the audio stream training data corresponding to the speaker identifier, the virtual audio stream training data corresponding to the speaker identifier, a preset first loss function and a preset first model optimization algorithm, and optimizes trainable parameters in the audio identification model corresponding to the speaker identifier until the value of the first loss function meets a preset first training termination condition, so as to obtain a currently trained audio identification model corresponding to the speaker identifier;

the server modifies the label of the virtual audio stream training data into true, inputs the virtual audio stream training data into a currently trained audio identification model corresponding to the speaker identification, and obtains the identification result of the virtual audio stream training data;

if the identification result of the virtual audio stream training data meets a preset second training termination condition, the server obtains a pre-trained audio generation model corresponding to the speaker identification and a pre-trained audio identification model corresponding to the speaker identification;

if the identification result of the virtual audio stream training data does not meet a preset second training termination condition, the server obtains a value of a second loss function according to the identification result of the virtual audio stream training data, the label of the virtual audio stream training data and a preset second loss function, optimizes trainable parameters of an audio generation model corresponding to the speaker identification according to the value of the second loss function and a preset second model optimization algorithm, and obtains a currently trained audio generation model corresponding to the speaker identification;

the server inputs the text stream training data corresponding to the speaker identifier into the currently trained audio generation model corresponding to the speaker identifier, reacquires the virtual audio stream training data corresponding to the speaker identifier, and repeatedly executes the steps of iteratively training the audio identification model corresponding to the speaker identifier and optimizing the trainable parameters of the audio generation model corresponding to the speaker identifier until the identification result of the virtual audio stream training data meets a preset second training termination condition, so as to obtain the pre-trained audio generation model corresponding to the speaker identifier and the pre-trained audio identification model corresponding to the speaker identifier.

16. An audio substitution system in a live room, comprising: a server and a viewer client;

17. An apparatus for audio substitution in a live room, comprising:

18. A computer device, comprising: processor, memory and computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 15 are implemented when the processor executes the computer program.

19. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 15.