CN113938697A

CN113938697A - Virtual speech method and device in live broadcast room and computer equipment

Info

Publication number: CN113938697A
Application number: CN202111193802.5A
Authority: CN
Inventors: 曾家乐
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-01-14
Anticipated expiration: 2041-10-13
Also published as: CN113938697B

Abstract

The application relates to the technical field of network live broadcast, and provides a virtual speech method, a virtual speech device and computer equipment in a live broadcast room, wherein the method comprises the following steps: responding to the live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier; acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier; replacing the user name in the first virtual speech to be called the user name corresponding to the user identifier to obtain a target virtual speech; and sending the target virtual speech to a client in the live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface. Compared with the prior art, the live broadcast expressive force of the anchor can be enhanced through the virtual speech, the atmosphere of a live broadcast room is mobilized, and the watching retention rate and the watching duration of a user are improved.

Description

Virtual speech method and device in live broadcast room and computer equipment

Technical Field

The embodiment of the application relates to the technical field of network live broadcast, in particular to a virtual speaking method and device in a live broadcast room and computer equipment.

Background

With the rapid development of internet technology and streaming media technology, live webcasting is becoming an entertainment means that is becoming popular. More and more users experience online interactions with the anchor within the live room.

However, because the live broadcast expressive force of some anchor broadcasters is insufficient, the interactive atmosphere in the live broadcast room is clumsy, and under various interactive scenes that a user presents virtual gifts, the user enters the live broadcast room and starts interactive playing methods and the like, the live broadcast room atmosphere is difficult to be autonomously transferred, the live broadcast interactive experience of the user is improved, so that the loss of the user is easily caused, and the watching retention rate and the watching duration of the user are difficult to be improved.

Disclosure of Invention

The embodiment of the application provides a virtual speech method, a virtual speech device and a computer device in a live broadcast room, which can solve the technical problems of insufficient expressive force of anchor live broadcast and poor experience of user live broadcast interaction, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a virtual speaking method in a live broadcast room, including the steps of:

responding to a live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier;

acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier;

replacing the user name in the first virtual speech to be called the user name corresponding to the user identifier to obtain a target virtual speech;

and sending the target virtual speech to a client in a live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

In a second aspect, an embodiment of the present application provides a virtual speaking method in a live broadcast room, including the steps of:

the server responds to a live broadcast interaction instruction, analyzes the live broadcast interaction instruction, and acquires an interaction scene identifier, a user identifier and a user name corresponding to the user identifier;

the server acquires a first virtual speech matched with the interactive scene corresponding to the interactive scene identification; the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier;

the server replaces the user name in the first virtual speech to be called the user name corresponding to the user identification, and a target virtual speech is obtained;

the server sends the target virtual speech to a client in a live broadcast room;

and the client in the live broadcast room receives the target virtual speech and outputs the target virtual speech to a live broadcast room interface.

In a third aspect, an embodiment of the present application provides a virtual speech apparatus in a live broadcast room, including:

the first response unit is used for responding to a live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier;

a first obtaining unit, configured to obtain a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier;

a first replacing unit, configured to replace a user name in the first virtual utterance, where the user name is referred to as a user name corresponding to the user identifier, to obtain a target virtual utterance;

and the first output unit is used for sending the target virtual speech to a client in a live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

In a fourth aspect, the present application provides a computer device, a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect or the second aspect when executing the computer program.

In a fifth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the method according to the first aspect or the second aspect.

In the embodiment of the application, a live broadcast interaction instruction is analyzed in response to the live broadcast interaction instruction, an interaction scene identifier, a user identifier and a user name corresponding to the user identifier are obtained, then a real utterance in the interaction scene corresponding to the interaction scene identifier is simulated, a first virtual utterance matched with the interaction scene corresponding to the interaction scene identifier is obtained, the user name in the first virtual utterance is replaced and called as the user name corresponding to the user identifier, a target virtual utterance is obtained, the target virtual utterance is sent to a client side in a live broadcast room, so that the client side in the live broadcast room outputs the target virtual utterance to an interface of the live broadcast room, thereby leading the user to feel the attention of the anchor, improving the live broadcast interactive experience of the user, enhancing the live broadcast expressive force of the anchor, and the method is favorable for promoting the generation of more live broadcast interactive behaviors, mobilizing the atmosphere of a live broadcast room and improving the watching retention rate and the watching duration of a user.

For a better understanding and implementation, the technical solutions of the present application are described in detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic view of an application scenario of a virtual speaking method in a live broadcast room according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a virtual speaking method in a live broadcast room according to a first embodiment of the present application;

fig. 3 is a schematic display diagram of a live broadcast room interface provided in an embodiment of the present application;

fig. 4 is a schematic display diagram of a target virtual comment in a live broadcast interface according to an embodiment of the present application;

fig. 5 is a schematic flowchart of S102 in a virtual speaking method in a live broadcast room according to a first embodiment of the present application;

fig. 6 is another schematic flowchart of S102 in a virtual speaking method in a live broadcast room according to the first embodiment of the present application;

fig. 7 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a second embodiment of the present application;

fig. 8 is a schematic flowchart of S203 in a virtual speaking method in a live broadcast room according to a second embodiment of the present application;

fig. 9 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a third embodiment of the present application;

fig. 10 is a schematic flowchart of S305 in a virtual speaking method in a live broadcast room according to a third embodiment of the present application;

fig. 11 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a fourth embodiment of the present application;

fig. 12 is a schematic view of a display of a real utterance and a target virtual utterance in a live broadcast interface according to an embodiment of the present application;

fig. 13 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a fifth embodiment of the present application;

fig. 14 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a sixth embodiment of the present application;

fig. 15 is a schematic structural diagram of a virtual speaking device in a live broadcast room according to a seventh embodiment of the present application;

fig. 16 is a schematic structural diagram of a computer device according to an eighth embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if/if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

As will be appreciated by those skilled in the art, the terms "client," "terminal device," and "terminal device" as used herein include both wireless signal receiver devices, which include only wireless signal receiver devices without transmit capability, and receiving and transmitting hardware devices, which include receiving and transmitting hardware devices capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (personal communications Service), which may combine voice, data processing, facsimile and/or data communications capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global positioning system) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially a computer device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle, such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., wherein a computer program is stored in the memory, and the central processing unit loads a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby accomplishing specific functions.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a virtual speaking method in a live broadcast room according to an embodiment of the present application, where the application scenario includes an anchor client 101, a server 102, and a viewer client 103, and the anchor client 101 and the viewer client 103 interact with each other through the server 102.

The proposed clients of the embodiment of the present application include the anchor client 101 and the viewer client 103.

It is noted that there are many understandings of the concept of "client" in the prior art, such as: it may be understood as an application program installed in a computer device, or may be understood as a hardware device corresponding to a server.

In the embodiments of the present application, the term "client" refers to a hardware device corresponding to a server, and more specifically, refers to a computer device, such as: smart phones, smart interactive tablets, personal computers, and the like.

When the client is a mobile device such as a smart phone and an intelligent interactive tablet, a user can install a matched mobile application program on the client and can also access a Web application program on the client.

When the client is a non-mobile device such as a Personal Computer (PC), the user can install a matching PC application on the client, and similarly can access a Web application on the client.

The mobile application refers to an application program that can be installed in the mobile device, the PC application refers to an application program that can be installed in the non-mobile device, and the Web application refers to an application program that needs to be accessed through a browser.

Specifically, the Web application program may be divided into a mobile version and a PC version according to the difference of the client types, and the page layout modes and the available server support of the two versions may be different.

In the embodiment of the application, the types of live application programs provided to the user are divided into a mobile end live application program, a PC end live application program and a Web end live application program. The user can autonomously select a mode of participating in the live webcasting according to different types of the client adopted by the user.

The present application can divide the clients into a main broadcasting client 101 and a spectator client 103, depending on the identity of the user using the clients.

The anchor client 101 is a client that transmits a live video, and is generally a client used by an anchor (i.e., a live anchor user) in live streaming.

The viewer client 103 refers to an end that receives and views a live video, and is typically a client employed by a viewer viewing a video in a live network (i.e., a live viewer user).

The hardware at which the anchor client 101 and viewer client 103 are directed is essentially a computer device, and in particular, as shown in fig. 1, it may be a type of computer device such as a smart phone, smart interactive tablet, and personal computer. Both the anchor client 101 and the viewer client 103 may access the internet via known network access means to establish a data communication link with the server 102.

Server 102, acting as a business server, may be responsible for further connecting with related audio data servers, video streaming servers, and other servers providing related support, etc., to form a logically associated server cluster for serving related terminal devices, such as anchor client 101 and viewer client 103 shown in fig. 1.

In the embodiment of the present application, the anchor client 101 and the audience client 103 may join in the same live broadcast room (i.e., a live broadcast channel), where the live broadcast room is a chat room implemented by means of an internet technology, and generally has an audio/video broadcast control function. The anchor user is live in the live room through the anchor client 101, and the audience of the audience client 103 can log in the server 102 to enter the live room to watch the live.

In the live broadcast room, interaction between the anchor and the audience can be realized through known online interaction modes such as voice, video, characters and the like, generally, the anchor performs programs for audience users in the form of audio and video streams, and economic transaction behaviors can also be generated in the interaction process. Of course, the application form of the live broadcast room is not limited to online entertainment, and can also be popularized to other relevant scenes, such as a video conference scene, a product recommendation sale scene and any other scenes needing similar interaction.

Specifically, the viewer watches live broadcast as follows: a viewer may click on a live application installed on the viewer client 103 and choose to enter any one of the live rooms, triggering the viewer client 103 to load a live room interface for the viewer, the live room interface including a number of interactive components, for example: video windows, virtual gift boxes, and public screens, among others.

There are a variety of interactive scenarios within the live room, such as: the live broadcast interactive system comprises a gift sending interactive scene for giving a virtual gift by a user, a gift brushing interactive scene for continuously giving the virtual gift by the user, an entrance interactive scene for allowing the user to enter a live broadcast room, and the like, wherein at the moment, a main broadcast generally interacts with the user through a speech in the live broadcast room, so that the interactive atmosphere of the live broadcast room is improved, and the live broadcast interactive experience of the user is enhanced.

The manner of speaking is not limited to speech in the form of voice or text.

However, some anchor broadcasters lack live experience and live expressive force is insufficient, so that the anchor broadcasters are often difficult to speak autonomously, enhance interactive experience with users and mobilize interactive atmosphere in a live room. In addition, there are also situations where there are too many users in the live broadcast room, and the anchor is difficult to give consideration to the interaction behavior of each user, which also easily causes that the anchor cannot give timely feedback of the speech.

Based on the above, the embodiment of the application provides a virtual speaking method in a live broadcast room. Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a virtual speaking method in a live broadcast room according to a first embodiment of the present application, where the method includes the following steps:

s101: and responding to the live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier.

S102: acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; and the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier.

S103: and replacing the user name in the first virtual speech to be called the user name corresponding to the user identification to obtain the target virtual speech.

S104: and sending the target virtual speech to a client in the live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

In this embodiment, a description is given of a virtual speaking method in a live broadcast room with a server as a main execution subject. Meanwhile, in order to more clearly illustrate each step in the virtual speaking method in the live broadcast room, the description from the perspective of the client is also supplemented to help understand the overall scheme. Wherein, the client comprises an anchor client and a spectator client.

In step S101, the server responds to the live broadcast interaction instruction, analyzes the live broadcast interaction instruction, and obtains an interaction scene identifier, a user identifier, and a user name corresponding to the user identifier.

The live broadcast interaction instruction can be any one of a virtual gift giving instruction, a user approach instruction, an interactive playing method starting instruction and the like, and the instruction is generated by triggering the client after the user generates a live broadcast interaction behavior.

The live broadcast interaction instruction at least comprises a live broadcast room identifier, a user identifier and an interaction scene identifier.

The live broadcast room identifier is a unique identifier corresponding to a live broadcast room (i.e., a channel), the live broadcast room is a live broadcast room created by a main broadcast, and the live broadcast room identifier is used for indicating in which live broadcast room the live broadcast interaction behavior is generated.

The user identifier is a unique identifier corresponding to the user, and the user can be a main broadcasting or a spectator. The user identification is used for indicating the live interactive behavior in the live broadcast room related to which users. For example: the virtual gift presenting behavior is related to an audience presenting the virtual gift and a main broadcasting receiving the virtual gift, and the user approach behavior is related to a main broadcasting creating a live broadcasting room and an audience entering the live broadcasting room.

And according to the user identification, the user name corresponding to the user identification can be determined. In an alternative embodiment, the username is referred to as a nickname of the user in the live network platform.

The interactive scene identification is used for indicating the current live broadcast room in which interactive scene, and the interactive scene corresponds to the live broadcast interactive behavior. For example: the virtual gift giving behavior corresponds to a gift giving interaction scene, and the user approach behavior corresponds to an approach interaction scene.

Regarding step S102, the server obtains a first virtual speech matching the interactive scene corresponding to the interactive scene identifier.

The first virtual speech is a virtual speech in a graphic and text form. In this embodiment, the graphic form is an expression, a symbol, a character, or the like, or an arbitrary combination of the expression, the symbol, and the character.

Referring to fig. 3, fig. 3 is a schematic display diagram of a live broadcast interface according to an embodiment of the present application. As can be seen, live components such as a video window 31, a virtual gift certificate 32, and a public screen 33 are displayed in the live room interface 3. Both the anchor and the audience can interact on the public screen 33 to output a realistic utterance in the form of a picture on the public screen.

The first virtual speech refers to a non-artificially output virtual speech in a text form, relative to a real speech output on a public screen.

The first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier. In order to obtain the first virtual utterance, a large number of real utterances in an interactive scene corresponding to the interactive scene identifier need to be acquired.

For example: in the present delivery interactive scene, the anchor will output a real speech on the public screen, wherein the real speech is that the anchor A (anchor nickname) receives the B (audience nickname) and gives the B (audience nickname) to the my present, and the thank you "say, then, by simulating the real speech in the present delivery interactive scene, the first virtual speech in the present delivery interactive scene can be obtained, and the first virtual speech is that the anchor A (anchor nickname) thank the B (audience nickname) sister is given to the my present.

In steps S103 to S104, the server replaces the user name in the first virtual comment to be the user name corresponding to the user identifier, obtains the target virtual comment, and sends the target virtual comment to the client in the live broadcast room, so that the client in the live broadcast room outputs the target virtual comment to the live broadcast room interface.

Specifically, the client in the live broadcast room outputs the target virtual speech to a public screen in the live broadcast room interface.

Referring to fig. 4, fig. 4 is a schematic view illustrating a display of a target virtual speech on a live broadcast interface according to an embodiment of the present application. It can be seen that a target virtual announcement 34 is displayed in the public screen 33, the target virtual announcement 34 being "anchor C received D sister's gift giving to my gift", where C is an anchor nickname for receiving the virtual gift, D is a viewer nickname for giving the virtual gift, and the viewer is a woman.

In an alternative embodiment, to further improve the simulation degree of the virtual speaking, referring to fig. 5, S102 includes steps S1021 to S1022, which are as follows:

s1021: and acquiring interactive keywords corresponding to the interactive scene identification.

S1022: inputting the interactive keywords corresponding to the interactive scene identification into a pre-trained virtual speech generation network model, and acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identification; the first virtual speech at least comprises interactive keywords or keywords similar to the meanings of the interactive keywords; the training data of the pre-trained virtual speech generation network model at least comprises real speech under a plurality of interactive scenes.

In step S1021, the server pre-stores the interactive keywords corresponding to different interactive scene identifiers.

In an alternative embodiment, the interactive keyword may be analyzed by the server from the real utterance in the interactive scenario corresponding to the different interactive scenario identifiers.

Specifically, the interactive keyword may identify a high-frequency word in the real utterance in the interactive scene corresponding to the interactive scene.

For example: the 'present' and 'present' in the present interactive scene are high-frequency words, so that the high-frequency words can identify corresponding interactive keywords for the present interactive scene.

In step S1022, the server inputs the interactive keyword corresponding to the interactive scene identifier to the pre-trained virtual speech generation network model, and obtains the first virtual speech matched with the interactive scene corresponding to the interactive scene identifier.

The first virtual speech at least comprises interactive keywords or keywords similar to the meanings of the interactive keywords.

The similarity of the interactive keyword meaning and the interactive keyword meaning in the first virtual speech is similar, so that the probability that the audience recognizes the interactive keyword as the virtual speech can be effectively reduced, and the effect of better simulating the real speech is achieved.

For example: the interactive keywords "give" and "present" in the present-giving interactive scene, the keywords with similar semantics as the interactive keywords "give", such as "give", "select", etc., and the keywords with similar semantics as the "present" keywords, such as the specific name of the virtual present.

The pre-trained virtual utterance generation network model may be obtained by any one of existing neural network training methods, and is not limited herein, and training data of the pre-trained virtual utterance generation network at least includes real utterances in a plurality of interaction scenarios.

In an alternative embodiment, the virtual speech generation network model and the virtual speech discrimination network model are combined into an anti-neural network model, and the virtual speech generation network model and the virtual speech discrimination network model are jointly trained, as shown in fig. 6, before S1022, the method includes the steps of:

s1023: the method comprises the steps of obtaining real speech under a plurality of interactive scenes and interactive keywords under the plurality of interactive scenes, inputting the interactive keywords under the plurality of interactive scenes into a virtual speech generation network model, and obtaining virtual speech under the plurality of interactive scenes.

S1024: iteratively training the virtual speech discrimination network model according to the real speech, the virtual speech, a preset first loss function and a preset first model optimization algorithm, and optimizing trainable parameters in the virtual speech discrimination network model until the value of the first loss function meets a preset first training termination condition to obtain the currently trained virtual speech discrimination network model.

S1025: and modifying the label of the virtual speech into true, inputting the virtual speech into the currently trained virtual speech discrimination network model, and acquiring the discrimination result of the virtual speech.

S1026: and if the identification result of the virtual speech meets a preset second training termination condition, obtaining a pre-trained virtual speech generation network model and a pre-trained virtual speech identification network model.

S1027: if the identification result of the virtual speech does not meet the preset second training termination condition, obtaining a value of a second loss function according to the identification result of the virtual speech, the label of the virtual speech and a preset second loss function, optimizing trainable parameters of the virtual speech generation network model according to the value of the second loss function and a preset second model optimization algorithm, and obtaining the currently trained virtual speech generation network model.

S1028: inputting interactive keywords in a plurality of interactive scenes into a currently trained virtual utterance generation network model, re-acquiring virtual utterances in the plurality of interactive scenes, and repeatedly executing the step of iteratively training the virtual utterance identification network model and the step of optimizing trainable parameters of the virtual utterance generation network model until the identification result of the first virtual utterance meets a preset second training termination condition, so as to obtain a pre-trained virtual utterance generation network model and a pre-trained virtual utterance identification network model.

In step S1023, the real speech in several kinds of interaction scenes and the interaction keywords in several kinds of interaction scenes are obtained, the interaction keywords in several kinds of interaction scenes are input to the virtual speech generation network model, and the virtual speech in several kinds of interaction scenes is obtained. The virtual speech generation network model is a virtual speech generation network model after random initialization.

With respect to step S1024, the tag of the real utterance is true and the tag of the virtual utterance is false. The method comprises the steps of inputting a real speech and a virtual speech into a virtual speech discrimination network model to obtain discrimination results of the real speech and the virtual speech, calculating a value of a first loss function according to the discrimination results of the real speech and the virtual speech and a preset first loss function, if the value of the first loss function meets a preset first training termination condition, obtaining a currently trained virtual speech discrimination network model, if the value of the first loss function does not meet the preset first training termination condition, optimizing trainable parameters of the virtual speech discrimination network model according to the value of the first loss function and a preset first model optimization algorithm, and repeating the steps until the value of the first loss function meets the preset first training termination condition, so as to obtain the currently trained virtual speech discrimination network model.

In steps S1025 to S1027, the label of the virtual utterance is modified to true, and the virtual utterance is input to the currently trained virtual utterance discrimination network model, so as to obtain the discrimination result of the virtual utterance. And if the identification result of the virtual speech meets a preset second training termination condition, obtaining a pre-trained virtual speech generation network model and a pre-trained virtual speech identification network model. If the identification result of the virtual speech does not meet the preset second training termination condition, obtaining a value of a second loss function according to the identification result of the virtual speech, the label of the virtual speech and a preset second loss function, optimizing trainable parameters of the virtual speech generation network model according to the value of the second loss function and a preset second model optimization algorithm, and obtaining the currently trained virtual speech generation network model.

In the anti-neural network model of this embodiment, when the probability that the virtual utterance is judged to be true is in the vicinity of 0.5, it means that both the virtual utterance generation network model and the virtual utterance identification network model achieve a relatively good anti-training effect. Therefore, the preset second training termination condition is an interval around 0.5, and when the discrimination result of the virtual utterance is within the interval, the discrimination result of the virtual utterance satisfies the preset second training termination condition.

If the identification result of the virtual speech is biased to 0, the probability that the virtual speech is considered to be true by the virtual speech identification network model is close to 0, which means that the virtual speech generated by the virtual speech generation network model is easy to be identified by the user, and the generation effect of the virtual speech generation network model is poor. Because the label of the virtual speech is modified to be true, namely 1, the value of the obtained second loss function is larger according to the label of the virtual speech, the identification result of the virtual speech and the preset second loss function, and then the trainable parameters of the virtual speech generation network model can be greatly optimized based on the value of the second loss function and the preset second model optimization algorithm, so that the currently trained virtual speech generation network model is obtained.

If the identification result of the virtual speech is biased to 1, the probability that the virtual speech identification network model considers that the virtual speech is true is close to 1, which means that the identification effect of the virtual speech identification network model is poor, and the virtual speech identification network model judges that the false virtual speech is true, so that the virtual speech identification network model needs to be trained continuously.

Regarding step S1028, inputting the interaction keywords in the plurality of interaction scenarios to the currently trained virtual utterance generation network model, re-acquiring the virtual utterances in the plurality of interaction scenarios, and repeatedly executing the step of iteratively training the virtual utterance identification network model and the step of optimizing the trainable parameters of the virtual utterance generation network model until the identification result of the first virtual utterance satisfies the preset second training termination condition, to obtain the pre-trained virtual utterance generation network model and the pre-trained virtual utterance identification network model.

The first loss function, the second loss function, the first model optimization algorithm, and the second model optimization algorithm are not limited herein, and may be any one of the existing loss functions and neural network model optimization algorithms.

In this embodiment, the virtual utterance generation network model and the virtual utterance identification network model form an anti-neural network model, and the virtual utterance generation network model and the virtual utterance identification network model are jointly trained, so that the generated first virtual utterance has higher credibility, and is easier to be regarded as a real artificial utterance by a user, and thus the live broadcast interaction experience of the user can be further improved.

In an alternative embodiment, the pre-trained virtual talk discrimination network model can be used to determine whether a user in the live broadcast room uses a talk store.

Specifically, the server acquires a speech output in a live broadcast interface, inputs the speech output in the live broadcast interface into a pre-trained virtual speech discrimination network model, and obtains a speech discrimination result; and judging whether the user in the live broadcast room adopts a speech plug-in program or not according to the speech identification result.

The pre-trained virtual speech discrimination network model is obtained through joint training with the virtual speech generation network model. The result of the qualification of the utterance is the probability that the utterance is true.

Optionally, if the average probability that the speech output in the interface of the live broadcast room is true does not satisfy the preset first threshold, it is determined that the user in the live broadcast room adopts the speech plug-in program, and if the average probability that the speech output in the interface of the live broadcast room is true satisfies the preset first threshold, it is determined that the user in the live broadcast room does not adopt the speech plug-in program.

Optionally, the method may further include separately detecting a user who frequently speaks in the live broadcast room, that is, obtaining a speech output by the user in a live broadcast room interface, inputting the speech of the user to a pre-trained virtual speech discrimination network model to obtain a speech discrimination result of the user, determining that the user adopts a speech plug-in program if an average probability that the speech discrimination result of the user is true does not satisfy a preset first threshold, and determining that the user does not adopt the speech plug-in program if the average probability that the speech discrimination result of the user is true satisfies the preset first threshold.

In this embodiment, whether a user in a live broadcast room uses a talk plug-in program can be effectively identified based on a pre-trained virtual talk identification network model, the possibility that the user swipes a screen through the plug-in program can be effectively reduced, and the interaction environment of the live broadcast room is improved.

In an optional embodiment, the live interactive instruction is a virtual gift giving instruction, and the method further includes the steps of:

and if the virtual gift giving instruction corresponding to the same user identifier is continuously responded, the server acquires a first quantity of the continuously given target virtual gifts of the user corresponding to the user identifier, issues the continuous sending prompt information matched with the first quantity to the client corresponding to the user identifier, enables the client corresponding to the user identifier to receive the continuous sending prompt information, and outputs the continuous sending prompt information to a live broadcasting room interface.

In this embodiment, the server may obtain a first quantity of the continuously presented target virtual gifts in a gift swiping interaction scene, and further generate continuous delivery prompt information according to the first quantity of the continuously presented target virtual gifts.

The continuous-sending prompt information at least comprises a second quantity and prompt contents corresponding to the target process, and the second quantity is the quantity which needs to be given continuously for the target virtual gift when the target process is executed by the trigger server.

Through continuous sending of the prompt information, the user can be prompted about how many virtual gifts are continuously presented, and the target process can be triggered, so that the generation of live broadcast interaction behaviors can be effectively promoted, and the interaction atmosphere in a live broadcast room is enhanced.

the server acquires the identity label corresponding to the user identification, generates identity opening prompt information if the identity label does not include a preset identity label, sends the identity opening prompt information to the client corresponding to the user identification, enables the client corresponding to the user identification to receive the identity opening prompt information, and outputs the identity opening prompt information to a direct broadcasting room interface.

The identity opening prompt message at least comprises an identity name corresponding to the preset identity label.

In the live broadcast room, some virtual gifts can be presented for users after having the identity tag, so that for some users who do not open the identity, the users can be prompted to open the identity through the identity opening prompt information so as to better interact with the main broadcast.

In an optional embodiment, the server may further detect a speech output in the live broadcast room interface, determine whether there is a potential safety hazard in the address connection if the speech includes an address link, and prohibit the user from triggering the address link to perform page jump if the address link is included in the speech, and move the user who sent the speech out of the live broadcast room, thereby effectively ensuring the account security of the user who participates in the network live broadcast, and improving the live broadcast interaction experience of the user.

Referring to fig. 7, fig. 7 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a second embodiment of the present application, including the following steps:

s201: and acquiring a real question and speech in the live broadcast room and a user identifier corresponding to a user sending the real question and speech, if the real answer speech related to the real question and speech is not output in the live broadcast room within a preset answer time limit, generating a live broadcast interaction instruction according to the user identifier and an interaction scene identifier corresponding to a question scene.

S202: and responding to the live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier.

S203: if the interactive scene corresponding to the interactive scene identification is a question scene, acquiring a real question utterance and a first virtual reply utterance matched with the real question utterance; and the first virtual reply speech is obtained by simulating a real reply speech about the real question speech in a question scene.

S204: and replacing the user name in the first virtual reply message to be called the user name corresponding to the user identification to obtain the target virtual speech.

S205: and sending the target virtual speech to a client in the live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

Steps S202 and S205 are the same as steps S101 and S104, and the related description may refer to the first embodiment, which is not repeated here.

In step S201, the server obtains the real question and speech in the live broadcast room and the user identifier corresponding to the user who sent the real question and speech, if the real answer speech related to the real question and speech is not output in the live broadcast room within the preset answer time limit, the live broadcast interaction instruction is generated according to the user identifier and the interaction scene identifier corresponding to the question scene.

In this embodiment, the server performs a sentence analysis on the utterance in the live broadcast room, determines whether the utterance is a question sentence or a question sentence, if so, determines that the utterance is a real question utterance, and then monitors whether a real reply utterance related to the real question utterance is output in the live broadcast room within a preset reply time limit.

And if the real reply speech related to the real question speech is not output in the live broadcast room within the preset reply time limit, generating a live broadcast interaction instruction according to the user identification and the interaction scene identification corresponding to the question scene.

Regarding step S203, after the server responds to the live broadcast interaction instruction and obtains the interaction scene identifier, if the interaction scene corresponding to the interaction scene identifier is a question scene, the server obtains a real question utterance and a first virtual reply utterance matched with the real question utterance.

In an alternative embodiment, the first virtual reply utterance is derived by simulating a real reply utterance with respect to a real question utterance in a question scene.

In another alternative embodiment, the first virtual reply utterance may also be a real reply utterance that was sent in respect of the real challenge utterance, which was hosted in a live room.

In the embodiment, the situation that the user asks for questions without response in the live broadcast room can be effectively prevented, so that the interactive enthusiasm of the user is avoided being reduced, and the user viscosity is favorably enhanced.

In an alternative embodiment, referring to fig. 8, S203 includes the steps of:

s2031: and acquiring the question keywords corresponding to the real question and speech.

S2032: inputting question keywords corresponding to the real question-asking speech into a pre-trained virtual speech generation network model, and acquiring a first virtual reply speech matched with the real question-asking speech; the training data of the pre-trained virtual speech generation network model at least comprises real reply speech about a plurality of real question speech in a question scene.

In a question interaction scene, the question keywords are related to the real question speech, and the server can perform semantic analysis on the real question speech so as to obtain the question keywords corresponding to the real question speech.

For example: for the real question and speech, "what the anchor likes to do at ordinary times", the question keyword of the real question and speech can be obtained as hobby or interest through semantic analysis.

The server inputs the question keywords corresponding to the real question-asking speech into the pre-trained virtual speech generation network model, and obtains a first virtual reply speech matched with the real question-asking speech.

The pre-trained virtual utterance generation network model may be obtained by any one of existing neural network training methods, and is not limited herein, and the training data of the pre-trained virtual utterance generation network at least includes real reply utterances about a plurality of real question utterances in a question scene.

In an optional embodiment, the virtual speech generation network model and the virtual speech discrimination network model are combined into an antagonistic neural network model, and the virtual speech generation network model and the virtual speech discrimination network model are jointly trained.

The specific training process is as follows:

and acquiring real reply speeches of the real question speeches, and acquiring the question keywords corresponding to the real question speeches according to the real question speeches. Inputting the question keywords corresponding to the real questioning speeches into the virtual speech generation network model, and acquiring the virtual reply speeches of the real questioning speeches.

And then, iteratively training a virtual utterance identification network model according to the real reply utterances of the real questioning utterances, the virtual reply utterances of the real questioning utterances, a preset first loss function and a preset first model optimization algorithm, and optimizing trainable parameters in the virtual utterance identification network model until the value of the first loss function meets a preset first training termination condition to obtain the currently trained virtual utterance identification network model.

And modifying the label of the virtual reply speech into true, inputting the virtual reply speech into the currently trained virtual speech discrimination network model, and acquiring the discrimination result of the virtual reply speech.

And if the identification result of the virtual reply speech meets a preset second training termination condition, obtaining a pre-trained virtual speech generation network model and a pre-trained virtual speech identification network model.

If the identification result of the virtual reply utterance does not meet the preset second training termination condition, obtaining a value of a second loss function according to the identification result of the virtual reply utterance, the label of the virtual reply utterance and a preset second loss function, and optimizing trainable parameters of the virtual utterance generation network model according to the value of the second loss function and a preset second model optimization algorithm to obtain the currently trained virtual utterance generation network model.

Inputting the question keywords corresponding to the real questioning speeches into the currently trained virtual speech generation network model, re-acquiring the virtual reply speeches of the real questioning speeches, and repeatedly executing the steps of iteratively training the virtual speech discrimination network model and optimizing trainable parameters of the virtual speech generation network model until the discrimination result of the virtual reply speech meets a preset second training termination condition, so as to obtain the pre-trained virtual speech generation network model and the pre-trained virtual speech discrimination network model.

In this embodiment, the virtual utterance generation network model and the virtual utterance identification network model form an anti-neural network model, and the virtual utterance generation network model and the virtual utterance identification network model are jointly trained, so that the credibility of the first virtual reply utterance is higher, and the first virtual reply utterance can be more easily considered as a real artificial reply by a user, and thus the live broadcast interaction experience of the user can be further improved.

Referring to fig. 9, fig. 9 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a third embodiment of the present application, including the following steps:

s301: and responding to the live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier.

S302: acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; and the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier.

S303: and replacing the user name in the first virtual speech to be called the user name corresponding to the user identification to obtain the target virtual speech.

S304: and sending the target virtual speech to a client in the live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

S305: the method comprises the steps of obtaining the activity of each live broadcast room, obtaining a second virtual speech if the activity does not meet a preset activity threshold, sending the second virtual speech to a client side in the live broadcast room, and enabling the client side in the live broadcast room to output the second virtual speech to a live broadcast room interface; wherein the second false utterance is obtained by simulating a real utterance related to the current hot topic.

Steps S301 to S304 are the same as steps S101 to S104, and the related description can be referred to the first embodiment, and will not be repeated here.

With respect to step S305, the server monitors the liveness of the various live rooms.

In an alternative embodiment, the liveness may be determined based on the number of users in the live room, the number of virtual gift presentation, and the number of utterances output by the live room interface.

And if the activity of the live broadcast room does not meet the preset activity threshold, namely the activity of the live broadcast room is low, the server acquires the second virtual speech and sends the second virtual speech to the client in the live broadcast room, so that the client in the live broadcast room outputs the second virtual speech to the interface of the live broadcast room.

And the second virtual speech is obtained by simulating a real speech related to the current hot topic by the server.

The current hot topic can be obtained by analyzing speech output from each live broadcast interface, or can be a hot search topic in a network live broadcast platform. If the number of the current hot topics is multiple, one current hot topic can be randomly selected to generate a second virtual speech, and the current hot topic matched with the anchor attribute tag can be selected according to the anchor attribute tag to generate the second virtual speech.

Through exporting the virtual speech of second in the live broadcast room interface, can effectively transfer the discussion atmosphere in the live broadcast room to amazing user and anchor carry out the interdynamic, improve the liveness in live broadcast room.

In an alternative embodiment, referring to fig. 10, the step S305 of obtaining the second virtual speech includes the steps of:

s3051: and acquiring topic keywords corresponding to the current hot topic.

S3052: inputting topic keywords into a pre-trained virtual speech generation network model, and acquiring a second virtual speech matched with the current hot topic; the second virtual speech at least comprises topic keywords or keywords similar to the meaning of the topic keywords; the training data of the pre-trained virtual utterance generation network model at least comprises real utterances related to the current hot topic.

The topic keywords are related to the content of the current hot topic, and the server can perform semantic analysis on the real speech related to the current hot topic, so that the topic keywords are obtained.

And then, the server inputs topic keywords into a pre-trained virtual speech generation network model, and acquires a second virtual speech matched with the current hot topic.

Wherein, the second virtual speech at least comprises topic keywords or keywords similar to the meaning of the topic keywords.

The pre-trained virtual utterance generation network model may be obtained by any existing neural network training method, and is not limited herein, and the training data of the pre-trained virtual utterance generation network model at least includes real utterances related to the current hot topic.

The specific training process is as follows:

and acquiring the real speech related to the plurality of topics, and acquiring topic keywords corresponding to the plurality of topics according to the real speech related to the plurality of topics. And inputting topic keywords corresponding to the topics into the virtual speech generation network model, and acquiring virtual speech matched with the topics.

And then, iteratively training the virtual speech discrimination network model according to the real speech related to the topics, the virtual speech matched with the topics, a preset first loss function and a preset first model optimization algorithm, and optimizing trainable parameters in the virtual speech discrimination network model until the value of the first loss function meets a preset first training termination condition to obtain the currently trained virtual speech discrimination network model.

And modifying the label of the virtual speech into true, inputting the virtual speech into a currently trained virtual speech discrimination network model, and acquiring the discrimination result of the virtual speech.

And if the identification result of the virtual speech meets a preset second training termination condition, obtaining a pre-trained virtual speech generation network model and a pre-trained virtual speech identification network model.

If the identification result of the virtual speech does not meet the preset second training termination condition, obtaining a value of a second loss function according to the identification result of the virtual speech, the label of the virtual speech and a preset second loss function, and optimizing trainable parameters of a virtual speech generation network model according to the value of the second loss function and a preset second model optimization algorithm to obtain the currently trained virtual speech generation network model.

Inputting topic keywords corresponding to a plurality of topics into the currently trained virtual speech generation network model, re-acquiring virtual speech matched with the plurality of topics, and repeatedly executing the steps of iteratively training the virtual speech discrimination network model and optimizing trainable parameters of the virtual speech generation network model until the discrimination result of the virtual speech meets a preset second training termination condition, so as to obtain the pre-trained virtual speech generation network model and the pre-trained virtual speech discrimination network model.

In this embodiment, the virtual utterance generation network model and the virtual utterance identification network model form an anti-neural network model, and the virtual utterance generation network model and the virtual utterance identification network model are jointly trained, so that the generated second virtual utterance matched with the current hot topic has higher credibility, and is easier to be considered by a user as a live broadcast room to perform real hot discussion, and thus the live broadcast interaction experience of the user can be further improved.

In an optional embodiment, after the second virtual speech is output on the interface of the live broadcast room, the increased activity level is obtained, and if the increased activity level meets a preset activity level increase threshold, an attribute tag of the anchor is set according to a current hot topic discussed in the live broadcast room. Wherein the anchor attribute tag can be entertainment, culture, movie or game, etc.

Through setting up the attribute label of anchor, can be convenient for discover the potentiality of anchor, do benefit to the live platform of network and give anchor certain guide, promote the live expressive force of anchor.

Referring to fig. 11, fig. 11 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a fourth embodiment of the present application, including the following steps:

s401: and responding to the live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier.

S402: acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; and the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier.

S403: and replacing the user name in the first virtual speech to be called the user name corresponding to the user identification to obtain the target virtual speech.

S404: sending the target virtual speech to a client in a live broadcast room, and enabling the target client in the live broadcast room to output the target virtual speech to a live broadcast room interface according to a preset style; the preset style is that the display background of the target virtual speech is different from the display background of the real speech, and the target user corresponding to the target client at least comprises the anchor.

Steps S401 to S403 are the same as steps S101 to S103, and the description thereof can be referred to the first embodiment, and will not be repeated here.

And S404, the server sends the target virtual speech to the client in the live broadcast room, so that the target client in the live broadcast room outputs the target virtual speech to the live broadcast room interface according to a preset style.

The preset style is that the display background of the target virtual speech is different from the display background of the real speech, and the target user corresponding to the target client at least comprises the anchor.

In an optional embodiment, the target user corresponding to the target client may further include a manager.

Referring to fig. 12, fig. 12 is a schematic view illustrating a display of a real speech and a target virtual speech in a live broadcast interface according to an embodiment of the present application. In the target client, as shown in fig. 12, the display background of the target virtual utterance 121 is different from the display background of the real utterance 122. The display background of the target virtual utterance 121 in fig. 12 is gray.

By displaying the target virtual speech and the real speech in a distinguishing manner, the target user can easily distinguish which virtual speech is.

In an optional embodiment, the server receives evaluation data of the target virtual speech sent by the target client, and optimizes the pre-trained virtual speech generation network model and the pre-trained virtual speech discrimination network model according to the target virtual speech and the evaluation data of the target virtual speech.

And the evaluation data is received by the target client through the evaluation control, and the evaluation control is displayed in a live broadcast interface when the target user successfully triggers the target virtual speech. For example: and after the user presses the target virtual speech for a long time, displaying the evaluation control beside the target virtual speech.

The evaluation data can reflect whether the target virtual speech well simulates the real speech, and the evaluation data of the target virtual speech is more beneficial to the optimization of the virtual speech generation network model and the virtual speech discrimination network model compared with the conventional training data.

In an optional embodiment, the server may also receive evaluation data for a speech sent by a client other than the target client, and if the speech is a target virtual speech, the evaluation data for the speech may also be used to optimize a pre-trained virtual speech generation network model and a pre-trained virtual speech discrimination network model, and if the speech is a real speech, the speech level of the user who sent the speech may be adjusted, or the speech of the user may be limited.

Referring to fig. 13, fig. 13 is a flowchart illustrating a virtual speaking method in a live broadcast room according to a fifth embodiment of the present application, including the following steps:

s501: and responding to the live broadcast interaction instruction, analyzing the live broadcast interaction instruction, and acquiring an interaction scene identifier, a user identifier and a user name corresponding to the user identifier.

S502: acquiring a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; and the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier.

S503: and replacing the user name in the first virtual speech to be called the user name corresponding to the user identification to obtain the target virtual speech.

S504: and sending the target virtual speech to a client in the live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

S505: acquiring the number of users and the real number of speeches in each live broadcast room, sending a speech play starting instruction to audience client sides in the live broadcast rooms when the number of users and the real number of speeches meet preset speech play starting conditions, enabling the audience client sides in the live broadcast rooms to respond to the speech play starting instruction, respectively storing judgment records of speeches output in a live broadcast room interface by the users, and respectively sending the judgment records to a server when responding to a speech play ending instruction sent by the server; the judgment record comprises a user identification corresponding to the user and a judgment result of the user for speaking, the judgment result is received by the audience client through the judgment control, and the judgment control is displayed in a live broadcast interface when the user successfully triggers the speaking.

S506: receiving the judgment records, obtaining the speech play scores corresponding to the user identifications according to the judgment results of the users on each speech in the judgment records, and sending the speech play scores corresponding to all the user identifications to the client in the live broadcast room, so that the client in the live broadcast room displays the speech play scores corresponding to all the user identifications in an interface of the live broadcast room.

Steps S501 to S504 are the same as steps S101 to S104, and the description thereof can be referred to the first embodiment, and will not be repeated here.

In step S505, the server obtains the number of users and the number of real speeches in each live broadcast room, and sends a speech play starting instruction to the audience clients in the live broadcast room when the number of users and the number of real speeches satisfy a preset speech play starting condition. That is to say, for the live broadcast room with higher number of users and higher number of real speech, the interactive speech playing method can be developed in the live broadcast room.

The speech play starting instruction at least comprises a live broadcast room identifier and a speech play identifier.

And then, the audience clients in the live broadcast room respond to the speech play starting instruction, respectively store the judgment records of the speech output in the live broadcast room interface by the user, and respectively send the judgment records to the server when responding to the speech play ending instruction sent by the server.

The judgment record comprises a user identification corresponding to the user and a judgment result of the user for speaking, the judgment result is received by the audience client through the judgment control, and the judgment control is displayed in a live broadcast interface when the user successfully triggers the speaking. For example: the judgment control can be displayed beside the speech after the user presses the speech for a long time.

The utterance playing ending instruction may be issued by the server when it is determined that the utterance playing duration reaches a preset utterance playing duration.

In an optional embodiment, after responding to the speech play opening instruction, the audience client in the live broadcast room further comprises a step of acquiring play description popup window data and displaying the play description popup window in a live broadcast room interface according to the play description popup window data. And the play description popup window displays a play description corresponding to the speech play identification.

In an optional embodiment, after responding to the speech play starting instruction, the audience client in the live broadcast room further comprises a step of acquiring judgment record submitting control data and displaying the judgment record submitting control in a live broadcast room interface according to the judgment record submitting control data. And the user can interact with the judgment record submitting control at any time in the speech playing method and submit the judgment record to the server.

It can be understood that, if the user submits the judgment record to the server for multiple times, the server may merge the judgment records corresponding to the same user identifier according to the user identifier in the judgment record.

In step S506, the server receives the judgment record, obtains the speech play scores corresponding to the user identifiers according to the judgment result of each speech by the user in the judgment record, and sends the speech play scores corresponding to all the user identifiers to the clients in the live broadcast room.

And the client in the live broadcast room displays the speech play scores corresponding to all the user identifications in the live broadcast room interface.

In an alternative embodiment, for users with higher speech play scores, rewards may be issued to the user.

It should be noted that the judgment record may also be used to further optimize the virtual utterance generation network model and the virtual utterance discrimination network model.

In this embodiment, through the interactive play method of the announcements in the live broadcast room, the interactive mode in the live broadcast room can be enriched, the live broadcast interactive experience is improved, a large amount of training data which can be used for optimizing the virtual announcement generation network model and the virtual announcement identification network model can be collected, and the simulation effect of the virtual announcement is further improved.

Referring to fig. 14, fig. 14 is a schematic flowchart illustrating a virtual speaking method in a live broadcast room according to a sixth embodiment of the present application, where the method includes:

s601: and the server responds to the live broadcast interaction instruction, analyzes the live broadcast interaction instruction, and acquires an interaction scene identifier, a user identifier and a user name corresponding to the user identifier.

S602: the server acquires a first virtual speech matched with the interactive scene corresponding to the interactive scene identification; and the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier.

S603: and the server replaces the user name in the first virtual speech to be called the user name corresponding to the user identification, and the target virtual speech is obtained.

S604: and the server sends the target virtual speech to the client in the live broadcast room.

S605: and the client in the live broadcast room receives the target virtual speech and outputs the target virtual speech to the live broadcast room interface.

In this embodiment, the server and the client execution agent describe a virtual speaking method in the live broadcast room. The above embodiments can be commonly described in the related description of the specific steps, and are not repeated here.

Please refer to fig. 15, which is a schematic structural diagram of a virtual speaking device in a live broadcast room according to a seventh embodiment of the present application. The apparatus may be implemented as all or part of a server in software, hardware, or a combination of both. The device 15 comprises:

a first response unit 151, configured to respond to a live broadcast interaction instruction, parse the live broadcast interaction instruction, and obtain an interaction scene identifier, a user identifier, and a user name corresponding to the user identifier;

a first obtaining unit 152, configured to obtain a first virtual speech matched with the interactive scene corresponding to the interactive scene identifier; the first virtual speech is obtained by simulating a real speech in an interactive scene corresponding to the interactive scene identifier;

a first replacing unit 153, configured to replace a user name in the first virtual utterance, which is called a user name corresponding to the user identifier, to obtain a target virtual utterance;

a first output unit 154, configured to send the target virtual speech to a client in a live broadcast room, so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface.

In the embodiment of the present application, the virtual speaking device in the live broadcast room is applied to a server. It should be noted that, when the virtual speaking device in the live broadcast room executes the virtual speaking method in the live broadcast room, the above-mentioned division of the function modules is merely used as an example, and in practical applications, the above-mentioned function distribution may be completed by different function modules according to needs, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the above-mentioned functions. In addition, the virtual speech device in the live broadcast room and the virtual speech method in the live broadcast room provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

Please refer to fig. 16, which is a schematic structural diagram of a computer device according to an eighth embodiment of the present application. As shown in fig. 16, the computer device 16 may include: a processor 160, a memory 161, and a computer program 162 stored in the memory 161 and executable on the processor 160, such as: a virtual speaking program in the live broadcast room; the steps in the first to sixth embodiments are implemented by the processor 160 executing the computer program 162.

The processor 160 may include one or more processing cores, among others. The processor 160 is connected to various components within the computer device 16 using various interfaces and lines to perform various functions of the computer device 16 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 161 and calling up data within the memory 161. alternatively, the processor 160 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), Programmable Logic Array (PLA). The processor 160 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing contents required to be displayed by the touch display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 160, but may be implemented by a single chip.

The Memory 161 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 161 includes a non-transitory computer-readable medium. The memory 161 may be used to store instructions, programs, code sets, or instruction sets. The memory 161 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 161 may optionally be at least one memory device located remotely from the processor 160.

The embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps of the foregoing embodiment, and a specific execution process may refer to specific descriptions of the foregoing embodiment, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above-described embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc.

The present invention is not limited to the above-described embodiments, and various modifications and variations of the present invention are intended to be included within the scope of the claims and the equivalent technology of the present invention if they do not depart from the spirit and scope of the present invention.

Claims

1. A method of virtual speaking in a live broadcast room, the method comprising the steps of:

2. The method for virtual speech in a live broadcast room according to claim 1, wherein the step of obtaining the first virtual speech matched with the interactive scene corresponding to the interactive scene identifier includes:

acquiring an interactive keyword corresponding to the interactive scene identifier;

inputting the interaction keywords corresponding to the interaction scene identification into a pre-trained virtual speech generation network model, and acquiring the first virtual speech matched with the interaction scene corresponding to the interaction scene identification; wherein the first virtual utterance includes at least the interactive keyword or a keyword similar to the semantic meaning of the interactive keyword; the training data of the pre-trained virtual speech generation network model at least comprises real speech under a plurality of interactive scenes.

3. The virtual speech method in a live broadcast room according to claim 2, wherein before the interactive keyword corresponding to the interactive scene identifier is input to a pre-trained virtual speech generation network model and the first virtual speech matched with the interactive scene corresponding to the interactive scene identifier is obtained, the method comprises the following steps:

acquiring the real speech under a plurality of interactive scenes and interactive keywords under a plurality of interactive scenes, inputting the interactive keywords under a plurality of interactive scenes into a virtual speech generation network model, and acquiring virtual speech under a plurality of interactive scenes;

iteratively training a virtual speech discrimination network model according to the real speech, the virtual speech, a preset first loss function and a preset first model optimization algorithm, and optimizing trainable parameters in the virtual speech discrimination network model until the value of the first loss function meets a preset first training termination condition to obtain a currently trained virtual speech discrimination network model;

modifying the label of the virtual speech into true, inputting the virtual speech into the currently trained virtual speech discrimination network model, and acquiring the discrimination result of the virtual speech;

if the identification result of the virtual speech meets a preset second training termination condition, obtaining the pre-trained virtual speech generation network model and the pre-trained virtual speech identification network model;

if the identification result of the virtual speech does not meet the preset second training termination condition, obtaining a value of a second loss function according to the identification result of the virtual speech, the label of the virtual speech and a preset second loss function, and optimizing trainable parameters of the virtual speech generation network model according to the value of the second loss function and a preset second model optimization algorithm to obtain a currently trained virtual speech generation network model;

inputting a plurality of interactive keywords in the interactive scene into the currently trained virtual utterance generation network model, re-acquiring a plurality of virtual utterances in the interactive scene, and repeatedly executing the step of iteratively training the virtual utterance identification network model and the step of optimizing trainable parameters of the virtual utterance generation network model until the identification result of the first virtual utterance meets the preset second training termination condition, so as to obtain the pre-trained virtual utterance generation network model and the pre-trained virtual utterance identification network model.

4. A virtual speaking method in a live broadcast room according to any one of claims 1-3, wherein said responding to the live broadcast interactive instruction is preceded by the steps of:

and acquiring a real question utterance in the live broadcast room and a user identifier corresponding to a user sending the real question utterance, if the real answer utterance related to the real question utterance is not output in a preset answer time limit, generating a live broadcast interaction instruction according to the user identifier and an interaction scene identifier corresponding to a question scene.

5. The method for virtual speech in a live broadcast room according to claim 4, wherein the step of obtaining the first virtual speech matched with the interactive scene corresponding to the interactive scene identifier includes:

if the interactive scene corresponding to the interactive scene identification is a question scene, acquiring the real question utterance and a first virtual reply utterance matched with the real question utterance; wherein the first virtual reply utterance is obtained by simulating a real reply utterance about the real question utterance in the question scene;

the replacing the user name in the first virtual speech is called as the user name corresponding to the user identifier to obtain the target virtual speech, and the method comprises the following steps:

and replacing the user name in the first virtual reply statement to be called the user name corresponding to the user identifier to obtain the target virtual statement.

6. The virtual speaking method in the live broadcast room as claimed in claim 5, wherein the step of obtaining the real question speech and the first virtual reply speech matching with the real question speech includes the steps of:

acquiring question keywords corresponding to the real question and speech;

inputting the question keywords corresponding to the real question-asking speech into a pre-trained virtual speech generation network model, and acquiring a first virtual reply speech matched with the real question-asking speech; training data of the pre-trained virtual utterance generation network model at least comprise real reply utterances about a plurality of real question utterances in the question scene.

7. A method of virtual speaking in a live broadcast room according to any one of claims 1 to 3, wherein the live interactive command is a virtual gift giving command, the method further comprising the steps of:

if the virtual gift giving instruction corresponding to the same user identifier is continuously responded, a first quantity of continuously given target virtual gifts of the user corresponding to the user identifier is obtained, and sending continuous sending prompt information matched with the first quantity to a client corresponding to the user identifier, so that the client corresponding to the user identifier receives the continuous sending prompt information and outputs the continuous sending prompt information to the live broadcast room interface; the continuous sending prompt information at least comprises a second quantity and prompt contents corresponding to the target process, wherein the second quantity is the quantity which needs to give the target virtual gift continuously when the server is triggered to execute the target process.

8. A method of virtual speaking in a live broadcast room according to any one of claims 1 to 3, wherein the live interactive command is a virtual gift giving command, the method further comprising the steps of:

acquiring an identity label corresponding to the user identification, if the identity label does not comprise a preset identity label, generating identity opening prompt information, sending the identity opening prompt information to a client corresponding to the user identification, enabling the client corresponding to the user identification to receive the identity opening prompt information, and outputting the identity opening prompt information to the direct broadcasting room interface; and the identity opening prompt message at least comprises an identity name corresponding to the preset identity label.

9. A method of virtual speech within a live broadcast room as claimed in any one of claims 1 to 3, characterized in that the method further comprises the steps of:

obtaining the activity of each live broadcast room, if the activity does not meet a preset activity threshold, obtaining a second virtual speech, and sending the second virtual speech to a client in the live broadcast room, so that the client in the live broadcast room outputs the second virtual speech to a live broadcast room interface; wherein the second false utterance is obtained by simulating a real utterance related to a current hot topic.

10. The virtual speaking method in the live broadcast room as claimed in claim 9, wherein said step of obtaining the second false speaking comprises the steps of:

obtaining topic keywords corresponding to the current hot topic;

inputting the topic keywords into a pre-trained virtual speech generation network model, and acquiring a second virtual speech matched with the current hot topic; wherein the second virtual utterance includes at least the topic keyword or a keyword similar to the topic keyword in semantic meaning; training data of the pre-trained virtual utterance generation network model at least includes real utterances related to the current hot topic.

11. The method for virtual speech in a live broadcast room according to any one of claims 1 to 3, wherein the step of sending the target virtual speech to a client in the live broadcast room so that the client in the live broadcast room outputs the target virtual speech to a live broadcast room interface comprises:

sending the target virtual speech to a client in the live broadcast room, and enabling the target client in the live broadcast room to output the target virtual speech to an interface of the live broadcast room according to a preset style; the preset style is that the display background of the target virtual speech is different from the display background of the real speech, and the target user corresponding to the target client at least comprises an anchor.

12. A method of virtual speech within a live broadcast room as claimed in claim 11, characterized in that the method further comprises the steps of:

receiving evaluation data of the target virtual speech sent by the target client; the evaluation data is received by the target client through an evaluation control, and the evaluation control is displayed in the live broadcast interface when the target user successfully triggers the target virtual speech;

and optimizing a pre-trained virtual speech generation network model and a pre-trained virtual speech discrimination network model according to the target virtual speech and the evaluation data of the target virtual speech.

13. A method of virtual speech within a live broadcast room as claimed in any one of claims 1 to 3, characterized in that the method further comprises the steps of:

acquiring the number of users and the number of real speeches in each live broadcast room, sending a speech play starting instruction to audience client sides in the live broadcast rooms when the number of users and the number of real speeches meet preset speech play starting conditions, enabling the audience client sides in the live broadcast rooms to respond to the speech play starting instruction, respectively storing judgment records of the users on the speeches output in interfaces of the live broadcast rooms, and respectively sending the judgment records to the server when responding to a speech play ending instruction sent by the server; the judgment record comprises a user identifier corresponding to the user and a judgment result of the user on the speech, the judgment result is received by the audience client through a judgment control, and the judgment control is displayed in the live broadcast interface when the user successfully triggers the speech;

receiving the judgment record, obtaining a speech play score corresponding to the user identifier according to a judgment result of the user on each speech in the judgment record, and sending the speech play scores corresponding to all the user identifiers to a client in the live broadcast room, so that the client in the live broadcast room displays the speech play scores corresponding to all the user identifiers in an interface of the live broadcast room.

14. A method of virtual speech within a live broadcast room as claimed in any one of claims 1 to 3, characterized in that the method further comprises the steps of:

obtaining a speech output in the interface of the live broadcast room, and inputting the speech output in the interface of the live broadcast room into a pre-trained virtual speech discrimination network model to obtain a speech discrimination result; the pre-trained virtual speech discrimination network model is obtained by joint training with a virtual speech generation network model;

and judging whether the user in the live broadcast room adopts a talk plug-in program or not according to the identification result of the talk.

15. A method for virtual speech in a live broadcast room, comprising:

16. A virtual speaking device in a live broadcast room, comprising:

17. A computer device, comprising: processor, memory and computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 14 or 15 are implemented when the processor executes the computer program.

18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 14 or 15.