CN111935555A

CN111935555A - Live broadcast interaction method, device, system, equipment and storage medium

Info

Publication number: CN111935555A
Application number: CN202010843274.2A
Authority: CN
Inventors: 徐冬博; 黄靖鸿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-13
Anticipated expiration: 2040-08-20
Also published as: CN111935555B

Abstract

The embodiment of the application discloses a live broadcast interaction method, a device, a system, equipment and a storage medium, wherein the method comprises the following steps: receiving first audio uploaded by a first user in a target live broadcast room, wherein the first audio corresponds to a first part of the target audio; sending the first audio to each second user in the target live broadcast room; receiving an audio upload request initiated by a second user to characterize its request for upload of a second audio based on the first audio, the second audio corresponding to a second portion of the target audio; determining a target second user in each second user in the target live broadcast room based on the received audio uploading request; receiving second audio uploaded by a target second user; and determining the score of the target second user for the target audio according to the matching degree between the second audio and the standard audio corresponding to the second part of the target audio. The method can enhance the interactive feeling among users in the network live broadcast and improve the interactive experience.

Description

Live broadcast interaction method, device, system, equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a live broadcast interaction method, apparatus, system, device, and storage medium.

Background

With the rapid development of internet technology, live webcasts mainly including live video and live voice are now deep into daily work and life of people, and various live webcasts can bring diversified information to people and provide fresh entertainment experience.

However, in the current live webcasting, the interaction between the anchor and the viewing user is single, and in general, the viewing user can only interact with the anchor by sending a bullet screen, or the viewing user can chat with the anchor by going to the market after paying a certain amount of virtual rewards (such as gifts). In practical application, the interactive experience of the interactive mode is not ideal for the anchor and the watching users, so that the viscosity of the watching users to the live broadcast room is greatly reduced, the live broadcast watching user quantity is reduced, and even the development of network live broadcast is influenced.

In summary, how to improve the interactive experience among users in live webcast becomes a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the application provides a live broadcast interaction method, a live broadcast interaction device, a live broadcast interaction system, live broadcast interaction equipment and a live broadcast interaction storage medium, which can enhance the interaction feeling among users in network live broadcast and improve the interaction experience between a main broadcast and watching users.

In view of this, a first aspect of the present application provides a live broadcast interaction method, where the method includes:

receiving first audio uploaded by a first user in a target live broadcast room, wherein the first audio corresponds to a first part of the target audio;

sending the first audio to each second user in the target live broadcast room;

receiving an audio upload request initiated by the second user, the audio upload request being used to characterize that the second user request an upload of a second audio based on the first audio, the second audio corresponding to a second portion of the target audio;

determining a target second user in each second user in the target live broadcast room based on the audio uploading request;

receiving the second audio uploaded by the target second user based on the first audio;

and determining the score of the target second user for the target audio according to the matching degree between the second audio and the standard audio corresponding to the second part of the target audio.

This application second aspect provides live interactive installation, the device includes:

the first audio receiving module is used for receiving first audio uploaded by a first user in a target live broadcast room, wherein the first audio corresponds to a first part of the target audio;

the first audio sending module is used for sending the first audio to each second user in the target live broadcast room;

an audio upload request receiving module, configured to receive an audio upload request initiated by the second user, where the audio upload request is used to characterize that the second user request uploads a second audio based on the first audio, and the second audio corresponds to a second portion of the target audio;

the target second user determination module is used for determining a target second user in each second user in the target live broadcast room based on the audio uploading request;

a second audio receiving module, configured to receive the second audio uploaded by the target second user based on the first audio;

and the second audio scoring module is used for determining the score of the target second user for the target audio according to the matching degree between the second audio and the standard audio corresponding to the second part of the target audio.

A third aspect of the present application provides a live interactive system, the system comprising: the system comprises a first terminal facing a first user, a second terminal facing a second user and a server;

the first terminal is used for receiving a first audio input by the first user in a target live broadcast room and uploading the first audio to the server, wherein the first audio corresponds to a first part of the target audio;

the second terminal is used for responding to an audio uploading request triggered by the second user after receiving the first audio sent by the server, generating an audio uploading request and sending the audio uploading request to the server; the audio uploading request is used for representing that the second user requests to upload second audio based on the first audio; the second audio corresponds to a second portion of the target audio;

the second terminal is further configured to receive the second audio input by the target second user based on the first audio and upload the second audio to the server when it is determined that the second user is the target second user;

the server is configured to execute the steps of the live broadcast interaction method according to the first aspect.

A fourth aspect of the present application provides an apparatus comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the steps of the live interaction method according to the first aspect.

A fifth aspect of the present application provides a computer-readable storage medium for storing a computer program for executing the steps of the live interaction method of the first aspect.

A sixth aspect of the application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of the live interaction method according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a live broadcast interaction method, and the method creatively provides a new live broadcast interaction form, namely, live broadcast interaction is carried out by a user in a live broadcast room in a way of grabbing a microphone to receive a song. Specifically, a first user in a target live broadcast room can upload a first audio input by the first user to a server, the first audio corresponds to a first part of the target audio, the server then sends the first audio to each second user in the target live broadcast room, and after receiving the first audio, the second user can initiate an audio upload request to the server to request to upload a second audio corresponding to a second part of the target audio based on the first audio; then, the server can determine a target second user in each second user in the target live broadcast room based on the received audio uploading request, and receive second audio uploaded by the target second user based on the first audio; furthermore, the server may determine, according to a matching degree between the second audio uploaded by the target second user and the standard audio corresponding to the second portion of the target audio, a score of the target second user for the target audio. Therefore, the first user and the second user in the target live broadcast room can carry out live broadcast interaction in a way of receiving audio (such as receiving songs) through the microphone, and in the process, the first user and the second user in the target live broadcast room can fully participate in the live broadcast interaction, so that the interactive feeling of the first user and the second user in the target live broadcast room is enhanced, and the interactive experience is improved.

Drawings

Fig. 1 is a schematic view of a working principle of a live broadcast interactive system provided in an embodiment of the present application;

fig. 2 is an interaction signaling diagram of a live broadcast interaction method according to an embodiment of the present application;

fig. 3 is a schematic view of a live broadcast interactive interface of a live broadcast APP provided in the embodiment of the present application;

FIG. 4 is a flowchart illustrating a process for determining a score of a user for a target audio according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a process for determining a score of lyrics according to an embodiment of the present application;

FIG. 6 is a schematic diagram of the working principle of the LSTM model provided in the embodiment of the present application;

fig. 7 is a schematic structural diagram of a first live broadcast interaction device according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a second live broadcast interaction device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a third live broadcast interaction device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a fourth live interactive apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a fifth live broadcast interaction device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Aiming at the problems of single interaction form, poor interaction experience, weak interaction sense and the like of users in network live broadcast in the related technology, the embodiment of the application provides a live broadcast interaction method.

Specifically, in the live broadcast interaction method provided by the embodiment of the application, a server may receive a first audio uploaded by a first user in a target live broadcast room, where the first audio corresponds to a first part of a target audio; then, the server can send the first audio received by the server to each second user in the target live broadcast room; after receiving the first audio, the second user may initiate an audio upload request to the server to request to upload a second audio corresponding to the second portion of the target audio based on the first audio; then, the server can determine a target second user in each second user in the target live broadcast room based on the received audio uploading request, and receive second audio uploaded by the target second user based on the first audio; furthermore, the server may determine a score of the target second user for the target audio according to a matching degree between the second audio and the standard audio corresponding to the second portion of the target audio.

In the live broadcast interaction method, the first user and the second user in the target live broadcast room can perform live broadcast interaction in a way of receiving audio (such as receiving songs) through a microphone, and in the process, the first user and the second user in the target live broadcast room can fully participate in the live broadcast interaction, so that the interactive feeling of the first user and the second user in the target live broadcast room is greatly enhanced, and the interactive experience of the first user and the second user in the target live broadcast room is improved.

It should be understood that the live broadcast interaction method provided by the embodiment of the present application may be applied to a background server of a live broadcast application, where the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing a live broadcast interaction service.

In order to facilitate understanding of the live broadcast interaction method provided by the embodiment of the present application, a live broadcast interaction system provided by the embodiment of the present application is introduced below in combination with an application scenario to which the live broadcast interaction method is applicable.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a live interactive system provided in the embodiment of the present application. As shown in fig. 1, the live interactive system includes a first terminal 110, a second terminal 120 and a server 130, and the first terminal 110 and the second terminal 120 can communicate with the server 130 through a wired or wireless network.

It should be noted that live APP (Application, APP) runs on the first terminal 110 and the second terminal 120, a first user logs in the live APP running on the first terminal 110, and a second user logs in the live APP running on the second terminal 120. For example, the first user may be a viewing user in a target live broadcast room, and the second user may be an online anchor in the target live broadcast room, in the live broadcast interaction system provided in the embodiment of the present application, multiple online anchors may exist in the target live broadcast room at the same time (at this time, the live broadcast interaction system includes multiple second terminals 120), and the multiple online anchors and the viewing user perform live broadcast interaction activities of receiving audio together.

In the live interactive system provided in the embodiment of the present application, the first terminal 110 is configured to receive a first audio input by a first user in a target live broadcast room, and upload the first audio to the server 130, where the first audio corresponds to a first part of the target audio.

Taking a first user as a watching user in a target live broadcast room, and taking a target audio as a target song as an example, the watching user in the target live broadcast room can select to perform song receiving interaction with an online main broadcast in the target live broadcast room, further, the server 130 sends a first part of the target song to be sung by the watching user to the first terminal 110, the watching user can input a first audio corresponding to the first part of the target song to the first terminal 110 according to a prompt given by the first terminal 110 (such as a prompt message for starting to sung, a prompt message for ending to sung, and lyric information of the first part of the target song, etc.), and the first terminal 110 sends the received first audio to the server 130 after the watching user finishes inputting the first audio.

Optionally, in order to further improve the live broadcast interaction experience of the first user and ensure that the live broadcast interaction experience better meets the personal requirements of the first user, the live broadcast interaction system provided in the embodiment of the present application may further provide a plurality of candidate audio lists for the first user, so that the first user may select an audio list from which the first user is based when performing live broadcast interaction with the second user in the target live broadcast room. I.e. the first terminal 110 is also adapted to:

responding to the triggered target activity participation operation, and displaying a plurality of candidate audio lists and virtual reward amounts corresponding to the candidate audio lists; generating an audio list selection request for representing a target audio list selected by the first user in the plurality of candidate audio lists and sending the audio list selection request to the server 130 in response to the triggered audio list selection operation; in response to the triggered virtual reward payment operation, a virtual reward payment request is generated and sent to the server 130, the virtual reward payment request is used for representing that the first user pays a target virtual reward, and the target virtual reward is matched with a virtual reward amount corresponding to the target audio list.

Still taking the first user as the watching user in the target live broadcast room and the target audio as the target song, after detecting that the watching user triggers the operation of participating in the song receiving activity in the target live broadcast room (i.e., the target activity participation operation), the first terminal 110 may display a plurality of candidate song lists (i.e., candidate audio lists) supported by the target live broadcast room and the number of gifts (i.e., virtual award amounts) corresponding to the candidate song lists. Furthermore, the viewing user may select a target menu from the candidate menus according to an actual requirement of the viewing user (i.e., triggering an audio list selection operation), where a song in the target menu is a target song according to which the viewing user performs a song receiving interaction with an online anchor in a target live broadcast room, and after detecting that the viewing user triggers the target menu selection operation, the first terminal 110 correspondingly generates a menu selection request (i.e., an audio list selection request) and sends the menu selection request to the server 130, so as to inform the server 130 that the viewing user selects the target menu. In addition, the viewing user needs to pay a corresponding gift for the target song slip selected by the viewing user (that is, trigger a virtual reward payment operation), after detecting that the viewing user triggers the gift payment operation, the first terminal 110 correspondingly generates a gift payment request (that is, a virtual reward payment request) and sends the request to the server 130, after receiving the gift payment request, the server 130 deducts a corresponding gift amount from a virtual account corresponding to the viewing user, that is, deducts the gift amount corresponding to the target song slip, and sends a feedback message to the first terminal 110 to inform the viewing user that the viewing user can perform live broadcast interaction based on the target song slip.

In the live broadcast interactive system provided in this embodiment, the second terminal 120 is configured to respond to an audio upload operation triggered by a second user after receiving a first audio sent by the server 130, generate an audio upload request, and send the audio upload request to the server 130, where the audio upload request is used to represent that the second user requests to upload a second audio based on the first audio, and the second audio corresponds to a second part of the target audio.

Accordingly, after receiving the audio upload request sent by the second terminal 120, the server 130 may select a target second user from the second users in the target live broadcast room based on the received audio upload request. Specifically, when the server 130 receives a plurality of audio upload requests initiated by a plurality of second users in the target live broadcast room, the server 130 may determine, according to the respective corresponding receiving times of the plurality of audio upload requests, a second user corresponding to the audio upload request with the earliest receiving time as a target second user qualified to upload the second audio, and further return an audio upload response message to the second terminal 120 used by the target second user to notify the target second user that the target second user is qualified to upload the second audio.

The second terminal 120 further receives a second audio input by the target second user based on the first audio when determining that the user (i.e., the second user) is the target second user, and uploads the second audio to the server 130.

Still taking the second user as the online anchor in the target live broadcast room, taking the target audio as the target song as an example, when multiple online anchors exist in the target live broadcast room at the same time, after receiving the first audio uploaded by the watching user through the first terminal 110 in the target live broadcast room, the server 130 forwards the first audio to each online anchor in the target live broadcast room, that is, to the second terminal 120 corresponding to each online anchor in the target live broadcast room; after receiving the first audio, the second terminal 120 plays the first audio, and the online anchor can touch a control on the interface (i.e., trigger an audio upload operation) within a singing time (e.g., within 15 s) according to an actual situation of the online anchor, and after detecting the touch-control singing control of the online anchor, the second terminal 120 correspondingly generates a singing receiving request (i.e., an audio upload request) and sends the request to the server 130.

If the server 130 receives the request to receive singing from the plurality of second terminals 120 within the singing grabbing time, the server needs to determine an online anchor (i.e., a target second user) with the singing receiving qualification according to the initiation time of each request to receive singing, for example, the server may determine that the online anchor corresponding to the request to receive singing with the earliest reception time has the singing receiving qualification according to the respective reception time of each request to receive singing. Further, server 130 may generate a corresponding pickup response message (i.e., an audio upload response message) for each pickup request received by it, i.e., for the earliest-initiated pickup request, generate a pickup response message indicating that the anchor online has pickup eligibility, and for other received pickup requests, generate a pickup response message indicating that the anchor online does not have pickup eligibility; and feeds back the generated respective singing reception response messages to the respective second terminals 120 accordingly.

The second terminal 120 may determine whether to allow the online anchor to input the second audio according to the singing receiving response message after receiving the singing receiving response message returned by the server 130, allow the online anchor to continue singing the second portion of the target audio based on the first audio if the received singing receiving response message indicates that the online anchor has the singing receiving qualification, receive the second audio input by the online anchor and send the second audio to the server 130, and prompt the online anchor not to rob the singing receiving qualification and fail to input the second audio if the received singing receiving response message indicates that the online anchor does not have the singing receiving qualification.

In the live broadcast interaction system provided in the embodiment of the present application, the server 130 is configured to execute the live broadcast interaction method provided in the embodiment of the present application. That is, the server 130 needs to receive a first audio uploaded by a first user in the target live broadcast room, forward the first audio to a second user in the target live broadcast room, receive an audio upload request initiated by a second user in the target live broadcast room, determine a target second user based on the audio upload request, receive a second audio uploaded by the target second user based on the first audio, and determine a score of the second audio for the target audio according to a matching degree between the second audio and a standard audio corresponding to a second part of the target audio. The operations performed by the server 130 in particular will be described in detail in the method embodiments below.

It should be understood that the structure of the live interactive system shown in fig. 1 is merely an example, and in practical applications, the live interactive system provided in the present embodiment is not limited to the structure shown in fig. 1, for example, the live interactive system provided in the present embodiment may include a plurality of second terminals 120, and may also include one second terminal 120, and for example, the first terminal 110 and the second terminal 120 in the live interactive system provided in the present embodiment are not limited to the smart phone shown in fig. 1, and may also be terminal devices such as a computer, a tablet computer, and a Personal Digital Assistant (PDA). The structure of the live interactive system provided in the embodiment of the present application is not limited at all.

The live broadcast interaction method provided by the application is described in detail through a method embodiment.

In order to facilitate understanding of the live broadcast interaction method provided by the embodiment of the present application, the live broadcast interaction method provided by the embodiment of the present application is introduced in the form of interaction between the first terminal, the second terminal, and the server. Referring to fig. 2, fig. 2 is an interaction signaling diagram of a live broadcast interaction method provided in an embodiment of the present application, and as shown in fig. 2, the live broadcast interaction method includes the following steps:

step 201: the first terminal receives first audio input by a first user in a target live broadcast room, wherein the first audio corresponds to a first part of the target audio.

The embodiment of the application provides a live broadcast interaction mode for interconnecting audio in a live broadcast room, when live broadcast interaction is carried out based on the live broadcast interaction mode, a first user in a target live broadcast room can input first audio through a live broadcast APP running in a first terminal, and the first audio corresponds to a first part of the target audio.

The target live broadcast room can be a voice live broadcast room and also can be a video live broadcast room, and the form of the target live broadcast room is not limited at all.

The first user may be a watching user in a target live broadcast room, or an online anchor in the target live broadcast room, and the identity of the first user is not limited herein; typically, there is only one first user in a round of live interaction.

The target audio may be audio corresponding to any content, such as audio corresponding to a song, audio corresponding to an article, audio corresponding to a poem, and the like, and the content form corresponding to the target audio is not limited in any way herein.

Accordingly, the first portion of the target audio may be set according to the content form and the actual service requirement corresponding to the target audio. For example, if the target audio corresponds to a song, the first part of the target audio may correspond to the first four words of the climax part of the song, at this time, the first user needs to sing the first four words of the climax part of the song, and records the audio sung by using the first terminal as the first audio, and of course, in practical applications, other parts of the song may also be set as the first part of the target audio; for another example, if the target audio corresponds to a poetry, the first part of the target audio may correspond to the first two sentences of the poetry, and in this case, the first user needs to recite the first two sentences of the poetry and record the recited audio as the first audio by using the first terminal. The first portion of the target audio is not limited in any way herein.

Illustratively, assuming that the first user is a watching user in the target live broadcasting room, the target audio is a target song, and the first part of the target audio corresponds to the first four words of the climax part of the target song, when the watching user participates in the song pickup interaction of the target live broadcasting room, the watching user can correspondingly sing the first part of the target song along with the accompaniment of the first part of the target song according to the prompt information displayed on the song pickup interaction interface of the live broadcasting APP, such as the timing for starting singing, the timing for ending singing, the words of the first part of the target song, the current progress of the target accompaniment song, and the like, at the same time, the first terminal records the audio played by the watching user as the first audio, and after detecting that the watching user finishes singing the first audio, for example, after detecting that the watching user finishes singing the fourth words of the climax part of the target song, or after detecting that the timing for ending singing has been reached, confirming that the viewing user completed the input of the first audio.

Optionally, in order to further improve the live broadcast interaction experience of the first user and ensure that the live broadcast interaction experience better meets the personal requirements of the first user, the live broadcast interaction method provided in the embodiment of the application may further provide a plurality of candidate audio lists for the first user, so that the first user may select an audio list according to which the first user performs live broadcast interaction with a second user in the target live broadcast room.

That is, the first terminal may respond to a target activity participation operation triggered by the first user, and display a plurality of candidate audio lists sent to the first terminal by the server and virtual reward amounts corresponding to the candidate audio lists respectively; furthermore, the first terminal can correspondingly generate an audio list selection request and a virtual reward payment request to be sent to the server in response to the audio list selection operation and the virtual reward payment operation triggered by the first user, wherein the audio list selection request is used for representing that the first user selects a target audio list from the candidate audio lists, and the virtual reward payment request is used for representing that the first user pays a target virtual reward, and the target virtual reward is matched with a virtual reward amount corresponding to the target audio list.

Each candidate audio list may include a plurality of selectable audios, and the plurality of selectable audios may all correspond to the same content format (e.g., all correspond to songs), or may respectively correspond to different content formats (e.g., respectively correspond to songs, poems, articles, etc.). For example, the server may provide the first user with candidate audio lists with three difficulty levels of bronze, silver, and king, where the three candidate audio lists with different difficulty levels respectively correspond to different virtual reward amounts.

For example, assuming that the first user is a viewing user in the target live broadcast room, the candidate audio lists are different candidate song lists, and after detecting that the viewing user triggers an operation of participating in a song receiving activity in the target live broadcast room (i.e., a target activity participation operation), the first terminal may display the candidate song lists supported by the target live broadcast room and the number of gifts (i.e., virtual award amounts) corresponding to the candidate song lists.

It should be understood that, the number of gifts corresponding to each of the candidate singing sheets and the candidate singing sheets supported by the target live broadcasting room may be sent to the first terminal by the server before the first terminal detects that the watching user triggers to participate in the song pickup activity, for example, sent when the watching user is detected to enter the target live broadcasting room, or may be fed back to the first terminal by the server in response to a song pickup request sent by the first terminal after the first terminal detects that the watching user triggers to participate in the song pickup activity.

Furthermore, a watching user can select a target song list (namely, triggering audio list selection operation) from a plurality of candidate song lists according to the actual requirement of the watching user, wherein songs in the target song list are target songs according to which the watching user and an online anchor in a target live broadcast room perform song receiving interaction, and after the first terminal detects that the watching user triggers the target song list selection operation, a song list selection request (namely, an audio list selection request) is correspondingly generated and sent to the server so as to inform the server that the watching user selects the target song list. In addition, the viewing user needs to pay a corresponding gift for the target song slip selected by the viewing user (that is, triggering a virtual reward payment operation), the first terminal generates a gift payment request (that is, a virtual reward payment request) correspondingly and sends the request to the server after detecting that the viewing user triggers the gift payment operation, and after receiving the gift payment request, the server deducts a corresponding gift number (that is, a target virtual reward) from a virtual account corresponding to the viewing user, that is, deducts the gift number corresponding to the target song slip, and sends a feedback message to the first terminal to inform the viewing user that the viewing user can perform live broadcast interaction based on the target song slip.

It should be understood that, in practical applications, the first terminal may send an audio list selection request to the server first and then send a virtual reward payment request to the server in response to the audio list selection operation and the virtual reward payment operation that are sequentially triggered by the first user; the first terminal may also simultaneously transmit the audio list selection request and the virtual reward payment request to the server after detecting that the first user completes the audio list selection operation and the virtual reward payment operation, or may combine the audio list selection request and the virtual reward payment request into the same request to transmit to the server. The timing and manner of sending the audio list selection request and the virtual bonus payment request by the first terminal are not limited in any way.

Step 202: the method comprises the steps that a first terminal sends first audio input by a first user to a server; that is, the server receives first audio uploaded by a first user in a target live broadcast room.

After detecting that the first user completes the input of the first audio, the first terminal can send the first audio to the server through the network; for example, assuming that the first user is a watching user in the target live broadcast room, and the first part of the target audio corresponds to the first four words of the climax part of the target song, the first terminal may confirm that the watching user completes the input of the first audio when detecting that the watching user sings the fourth words of the climax part of the target song, or may confirm that the watching user completes the input of the first audio when detecting that the playing of the accompaniment corresponding to the first four words of the climax part of the target song is completed, and then, send the first audio input by the watching user to the server through the network.

Step 203: the server sends the first audio to a second terminal, and the second terminal corresponds to each second user in the target live broadcast room; namely, the second terminal receives the first audio sent by the server.

After receiving the first audio uploaded by the first terminal, the server confirms second users in the same target live broadcast room with the first user, and then forwards the first audio to second terminals used by the second users in the target live broadcast room.

The second user may be an online anchor in the target live broadcast room, or may be a viewing user participating in live broadcast interaction in the target live broadcast room, and the identity of the second user is not limited in this application. The identities of the second user and the first user may be the same or different, for example, the first user and the second user may both be watching users in a target live broadcast room; for another example, the first user may be a viewing user participating in live interaction in the target live broadcast room, and the second user may be an online anchor in the target live broadcast room; in a round of live interaction, there may be one second user or there may be multiple second users. In addition, in practical application, online anchor and watching users in the target live broadcast room can be not distinguished, the user uploading the first audio in the target live broadcast room is directly used as a first user, and other users except the first outdoor user in the target live broadcast room are used as second users. The first user and the second user are not limited in any way herein.

For example, assuming that the second user is an online anchor in the target live broadcast room and multiple online anchors exist in the target live broadcast room at the same time, the server needs to forward the first audio uploaded by the first user to each online anchor in the target live broadcast room at the same time.

Step 204: and the second terminal responds to the audio uploading operation triggered by the second user and generates an audio uploading request, wherein the audio uploading request is used for representing that the second user requests to upload second audio based on the first audio, and the second audio corresponds to a second part of the target audio.

Step 205: sending an audio uploading request to a server; that is, the server receives the audio upload request sent by the second terminal.

Step 206: and the server determines a target second user from the second users in the target live broadcast room based on the audio uploading request.

Step 207: and the server generates an audio uploading response message according to the determination result of the target second user.

Step 208: the server sends an audio uploading response message to the second terminal; that is, the second terminal receives the audio upload response message.

Since the steps 204 to 208 are strongly related, the overall implementation process of the steps 204 to 208 is described below.

In order to increase the interest of live broadcast interaction, under the condition that a plurality of second users exist in a target live broadcast room at the same time, the live broadcast interaction method provided by the embodiment of the application can be used for setting a link that the plurality of second users contend for the qualification of inputting the second audio. That is, after receiving a first audio sent by the server, the second terminal responds to an audio uploading operation triggered by the second user within a preset time period, generates an audio uploading request and sends the audio uploading request to the server, wherein the audio uploading request is used for representing that the second user requests to upload a second audio based on the first audio.

If the server receives a plurality of audio uploading requests within a preset time period, determining a second user corresponding to the audio uploading request with the earliest receiving time as a target second user according to the respective receiving time of the plurality of audio uploading requests; and based on the determination result of the target second user, respectively generating corresponding audio uploading response messages aiming at the plurality of received audio uploading requests, wherein the audio uploading response messages are used for representing whether the second user is qualified for uploading second audio (namely whether the second user is the target second user corresponding to the target audio), and correspondingly returning each generated audio uploading response message to each second terminal.

Specifically, the server may generate a corresponding audio upload response message for each received audio upload request, that is, for the audio upload request with the earliest reception time, an audio upload response message for indicating that the second user is qualified to upload the second audio may be generated, and for other audio upload requests, an audio upload response message for indicating that the second user is not qualified to upload the second audio may be generated. And further, the generated plurality of audio uploading response messages are correspondingly fed back to the second terminals.

After receiving the audio upload response message, the second terminal may determine whether the second user corresponding to the second terminal is the target second user according to the audio upload response message, that is, determine whether the second user corresponding to the second terminal is allowed to input the second audio. Specifically, if the received audio upload response message indicates that the second user is qualified for uploading the second audio, determining that the second user is the target second user, and prompting the target second user that the second user can upload the second audio; and if the received audio uploading response message indicates that the second user does not have the qualification of uploading the second audio, prompting the second user that the second user cannot upload the second audio and not opening a second audio uploading entrance for the second user.

Exemplarily, taking a second user as an online anchor in a target live broadcast room, and taking a second part of a target audio frequency corresponding to fifth to eighth lyrics of a climax part of a target song as an example, when multiple online anchors exist in the target live broadcast room at the same time, after receiving a first audio frequency uploaded by a watching user in the target live broadcast room, a server forwards the first audio frequency to a second terminal corresponding to each online anchor in the target live broadcast room; and after the second terminal detects the touch control of the online anchor, correspondingly generating a singing receiving request (namely an audio uploading request) and sending the singing receiving request to the server.

If the server receives the singing receiving requests from the plurality of second terminals within the singing grabbing time, the server needs to determine the online anchor with the singing receiving qualification (namely the qualification of uploading the second audio) according to the receiving time of each singing receiving request, for example, the server may determine that the online anchor corresponding to the singing receiving request with the earliest receiving time has the singing receiving qualification. Furthermore, the server may generate a corresponding singing receiving response message (i.e., an audio uploading response message) for each received singing receiving request, that is, for the singing receiving request with the earliest initiation time, generate a singing receiving response message for indicating that the online anchor has singing receiving qualification, and for other received singing receiving requests, generate a singing receiving response message for indicating that the online anchor does not have singing receiving qualification; and correspondingly feeding back the generated singing receiving response messages to the second terminals.

After receiving the singing receiving response message returned by the server, the second terminal can determine whether the on-line anchor is allowed to input the second audio according to the singing receiving response message, if the received singing receiving response message indicates that the on-line anchor has the singing receiving qualification, the on-line anchor is allowed to continue singing the second part of the target audio based on the first audio, and if the received singing receiving response message indicates that the on-line anchor does not have the singing receiving qualification, the second terminal prompts that the on-line anchor does not rob the singing receiving qualification and cannot input the second audio.

Step 209: the second terminal receives second audio input by the target second user based on the first audio.

After receiving the audio upload response message sent by the server, the second terminal may further receive a second audio input by the target second user based on the first audio if it is determined that the corresponding second user is the target second user, where the second audio should correspond to a second part of the target audio.

The second portion of the target audio may correspond to content in the target audio that is adjacent to the first portion of the target audio; for example, if the first part of the target audio corresponds to the first four words of the climax part of the target song, the second part of the target audio may correspond to the fifth to eighth words of the climax part of the target song, at this time, the target second user needs to sing the fifth to eighth words of the climax part of the target song, and the audio sung by the second terminal is recorded as the second audio; for another example, if the first portion of the target audio corresponds to the first two sentences of the target verse, the second portion of the target audio may correspond to the third and fourth sentences of the target verse, in which case the target second user needs to recite the third and fourth sentences and record the recited audio thereof as the second audio using the second terminal. The content corresponding to the second portion of the target audio is not limited in any way herein.

Illustratively, assuming that the target second user is an online anchor in the target live broadcast room, the target audio is a target song, and the second part of the target audio corresponds to the fifth to eighth lyrics of the climax part of the target song, the online anchor may sing the second part of the target song correspondingly following the accompaniment of the second part of the target song according to the prompt information displayed on the live APP song receiving interactive interface after determining that the online anchor is qualified to sing the first audio, such as the timing for starting singing, the timing for ending singing, the lyrics of the second part of the target song, the current progress of the accompaniment of the target song, and at the same time, the second terminal may record the audio sung by the online anchor as the second audio, and after detecting that the online anchor finishes sings the second audio, for example, after detecting that the online anchor sings the eighth lyrics of the climax part of the target song, or after the opportunity of finishing singing is detected to be reached, confirming that the online anchor finishes inputting the second audio.

Step 210: the second terminal sends second audio uploaded by the second user based on the first audio to the server; that is, the server receives the second audio uploaded by the second user based on the first audio.

After the second terminal detects that the target second user completes the input of the second audio, the second audio can be sent to the server through the network; for example, assuming that the target second user is an online anchor in the target live broadcast, and the second part of the target audio corresponds to the fifth to eighth lyrics of the climax part of the target song, the second terminal may confirm that the online anchor completes the input of the second audio when detecting that the online anchor completes the eighth lyrics of the climax part of the target song, or may confirm that the online anchor completes the input of the second audio when detecting that the fifth to eighth lyrics of the climax part of the target song complete the playing of the accompaniment corresponding to the eighth lyrics, and then, send the second audio input by the online anchor to the server through the network.

Step 211: and the server determines the score of the target second user for the target audio according to the matching degree between the second audio and the standard audio corresponding to the second part of the target audio.

After receiving the second audio uploaded by the second terminal, the server may call the standard audio corresponding to the second part of the target audio, and then determine the score of the target second user for the target audio according to the matching degree between the received second audio and the standard audio corresponding to the second part of the target audio, it should be understood that the higher the matching degree between the second audio and the standard audio corresponding to the second part of the target audio, the higher the score of the target second user for the target audio.

Taking the target audio as the target song, and the second part of the target audio corresponding to the fifth to eighth lyrics of the climax part of the target song as an example, the standard audio corresponding to the second part of the target audio may be the audio corresponding to the fifth to eighth lyrics of the climax part cut from the original target audio (e.g., the audio of the original singer singing the target song). Taking the target audio as the target verse, the second part of the target audio corresponding to the third and fourth verses of the target verse as an example, the standard audio corresponding to the second part of the target audio may be the audio corresponding to the third and fourth verses intercepted from the original target audio (i.e., the audio where the professional reciting recites the target verse). Of course, in a case that the target audio corresponds to other content, the standard audio corresponding to the second portion of the target audio may correspond to other content, and the application does not limit the standard audio corresponding to the second portion of the target audio.

Because a link of preemptively receiving the audio is set in the live broadcast interaction process, the server can only receive the second audio uploaded by the target second user, at this time, the server needs to determine the score of the target second user for the target audio according to the second audio, that is, the score of the target second user for the target audio is determined according to the matching degree between the received second audio and the standard audio corresponding to the second part of the target audio, and for other second users in the target live broadcast room, the server can directly determine that the score of the target second user for the target audio is 0.

It should be noted that, in practical applications, in a round of live broadcast interaction, a first user and a second user in a target live broadcast room may perform audio receiving interaction based on N (N is an integer greater than 1) target audios, for example, in a case that a target audio list selected by the first user includes N target audios, the first user and the second user in the target live broadcast room may perform audio receiving interaction based on the N target audios. In this case, the above steps 201 to 211 need to be executed in a loop N times, each time the first user needs to input one first audio, each time the server needs to correspondingly select the target second user qualified to upload the second audio from the second users in the target live broadcast room; that is, for an ith (i is an integer greater than or equal to 1 and less than or equal to N) first audio (corresponding to a first portion of an ith target audio) uploaded by a first user, a server needs to determine a target second user corresponding to the ith target audio based on an audio upload request initiated by a second user in a target live broadcast room for the ith first audio. For each second user in the target live room, the server may determine its score for each target audio.

Because a link of preemptively receiving audio is set in the live broadcast interaction process, the server can determine the score of each second user in the target live broadcast room for the ith (i is an integer greater than or equal to 1 and less than or equal to N) target audio by the following method: receiving a second audio uploaded by a target second user corresponding to the ith target audio based on the ith first audio; and then, according to the matching degree between the second audio uploaded by the target second user corresponding to the ith target audio and the standard audio corresponding to the second part of the ith target audio, determining the score of the target second user corresponding to the ith target audio for the ith target audio, and determining the score of the target second user corresponding to the ith target audio for other second users except the target second user corresponding to the ith target audio in the target live broadcast room to be 0.

Specifically, under the condition that a link of preemptively receiving audio is set in the live broadcast interaction process, the server needs to determine a target second user corresponding to an ith target audio according to an audio upload request initiated by each second user for the ith first audio uploaded by the first user in the target live broadcast room, where a specific determination manner is introduced above, and reference may be made to the content of the above related part in detail. In this case, the server can only receive the second audio uploaded by the target second user corresponding to the ith target audio, and therefore, for the target second user corresponding to the ith target audio, the server may determine the score of the target second user for the ith target audio according to the matching degree between the second audio uploaded by the target second user and the standard audio corresponding to the second part of the ith target audio; and for other second users in the target live broadcast room except the target second user, the server can directly determine that the score of the second user aiming at the ith target audio is 0.

Optionally, the server may assign a virtual reward to the targeted second user based on the score of the targeted second user for the targeted audio.

In order to improve the participation enthusiasm of the second user in the target live broadcast room, after the server determines the score of the target second user for the target audio, the server can distribute virtual rewards to the target second user according to the score of the target second user for the target audio. For example, assuming that the target second user is an online anchor in the target live room, the server may play a gift for the online anchor according to a score for the target audio by the online anchor.

In a possible implementation manner, the server may preset a corresponding relationship between the score and the virtual award amount, and after determining the score of the target second user for the target audio, allocate the virtual award amount corresponding to the score to the target second user, that is, type the virtual award into the virtual account corresponding to the target second user.

In practical application, the server may also adopt other strategies to allocate the virtual rewards to the target second users in the target live broadcast room, and the method for allocating the virtual rewards to the target second users by the server is not limited in this application.

It should be noted that, if a first user and a second user in a target live broadcast room in a round of live broadcast interaction perform audio receiving interaction based on N (N is an integer greater than 1) target audios, a server may determine, for each second user in the target live broadcast room, a total score corresponding to the second user according to scores of the second user for the N target audios; and further determining the second user with the highest total score as a winning second user, and distributing the virtual reward to the winning second user.

Specifically, if audio receiving interaction needs to be performed based on N target audios in a round of live broadcast interaction, the server needs to obtain, for each second user in the target live broadcast room, a score of the second user for the first target audio to a score of the second user for the nth target audio, and then calculate a sum of the N scores as a total score corresponding to the second user. Further, the second user with the highest total score is determined as the winning second user, and the virtual reward is distributed to the winning second user.

In practical application, the server may also determine that the second users with the top scores are the winning second users, and then allocate the corresponding virtual rewards to the winning second users according to the preset reward allocation rule. The manner in which the winning second user is determined, and the virtual award allocated to the winning second user, are not limited in any way herein.

Illustratively, it is assumed that the second user is an online anchor in a target live broadcast room, the first user is a watching user in the target live broadcast room, and the watching user selects a target song list from a plurality of candidate song lists when selecting to participate in a song receiving live broadcast interaction and pays a corresponding gift for the target song list. Correspondingly, the server needs to determine the total score of each target song in the target song list in the round of live broadcast interaction according to the score of each target song in the target song list in the round of live broadcast interaction; further, the server may determine that the online anchor with the highest overall score is the winning anchor for the round of live interactions and distribute the gifts paid by the viewing user to the winning anchor.

Furthermore, in order to improve the experience of the live interaction, the server may further determine, as the second audio to be synthesized, the second audio uploaded by the winning second user in the round of live interaction, and determine, as the first audio to be synthesized, the first audio on which the second audio to be synthesized is based; and further, generating an audio album according to the first audio to be synthesized and the second audio to be synthesized, and sending the audio album to the first user and the winning second user, namely to the first terminal used by the first user and the second terminal used by the winning second user, so that the first user and the winning second user can download the audio album.

For example, assuming that the first user is a viewing user in a target live broadcast room and the second user is an online anchor in the target live broadcast room, after the server determines a winning anchor in a round of live broadcast interaction in the above manner, the server may further call a second audio uploaded by the winning anchor in the round of live broadcast interaction as a second audio to be synthesized, call a first audio uploaded by the viewing user corresponding to the second audio as a first audio to be synthesized, and then synthesize an audio between the first audio to be synthesized and the second audio to be synthesized, which have a corresponding relationship; if the winning anchor uploads a plurality of second audios in the current round of live interaction, the server may synthesize a plurality of audios based on the plurality of second audios and a first audio corresponding to each of the plurality of second audios; finally, the server may compose an audio album using its synthesized audio and send the audio album to the viewing user and the winning anchor participating in the live interaction so that the viewing user and the winning anchor download the audio album.

In addition, if the number of the audio albums synthesized by the server for a certain online anchor reaches a preset number (such as 3 audio albums), the achievement of the online anchor can be shown on a medal display interface of the live APP, so that each online anchor in the live APP is encouraged to actively participate in the live interaction of receiving the audio.

In the live broadcast interaction method provided by the embodiment of the application, the first user and the second user in the target live broadcast room can perform live broadcast interaction in a way of receiving audio (such as receiving songs) through a microphone, and in the process, the first user and the second user in the target live broadcast room can fully participate in the live broadcast interaction, so that the interactive feeling of the first user and the second user in the target live broadcast room is greatly enhanced, and the interactive experience of the first user and the second user in the target live broadcast room is improved. In addition, virtual rewards are distributed to the second users according to scores of the second users for the target audio, the participation enthusiasm of the second users in the target live broadcast room can be further improved, and the liveness of the live broadcast of the network is enhanced.

In order to further understand the live broadcast interaction method provided by the embodiment of the application, it is assumed that the first user is a watching user in the target live broadcast room, the second user is an online anchor in the target live broadcast room, multiple online anchors exist in the target live broadcast room at the same time, and the watching user in the target live broadcast room and the online anchor perform live broadcast interaction in a song receiving manner. Based on this, a live broadcast interaction method provided by the embodiment of the present application is integrally and exemplarily described with reference to a live broadcast interaction interface schematic diagram of a live broadcast APP shown in fig. 3.

It should be noted that, in order to enhance the live broadcast interactive atmosphere, the atmosphere effect of the live broadcast interactive interface may be set as a stage flashing effect, for example, the live broadcast interactive interface may be ignored according to a certain frequency, and the live broadcast interactive interface includes both a live broadcast interactive interface facing the viewing user and a live broadcast interactive interface facing the online anchor.

A selection control of singing and receiving can be additionally arranged on an interactive playing method panel of the voice live broadcasting room, and a watching user clicks the selection control of singing and receiving, so that the user can correspondingly trigger participation in singing and interacting activities to become a host of the current round of live broadcasting interaction. The server can provide candidate vocalists with bronze, silver and king difficulty levels for the watching user, the watching user can select 8 target songs from the candidate vocalists with the same difficulty level, then the selected target songs are utilized to form the target vocalists, corresponding gifts are paid for the selected target vocalists, and for example, 8 airplanes need to be paid for the user to watch the vocalists with the bronze difficulty level.

After a watching user pays a corresponding gift for a target song form selected by the watching user, a control for receiving the song can be clicked to start, a live broadcast interaction mode for receiving the song is entered, after the countdown of 3s, lyrics required to be sung by the watching user are displayed on a live broadcast interaction interface, the watching user can directly sing after the prelude is finished, and meanwhile, a terminal used by the watching user correspondingly records the audio frequency sung by the watching user. Typically, the part that the viewing user is required to sing belongs to the climax part of the target song.

And the terminal used by the watching user sends the audio singing by the watching user to the server, and the server forwards the audio to each on-line anchor broadcast in the voice live broadcast room. After the watching user finishes singing, the time of 15s can be set as a singing grabbing link of an online anchor in a voice live broadcasting room, namely after each online anchor listens to received audio, a 'singing grabbing' control in a live broadcasting active program can be clicked according to the condition of the online anchor, and the server can determine the online anchor qualified for singing according to the condition that each online anchor clicks the 'singing grabbing' control.

The on-line anchor broadcasting which robs to singing qualification can have the singing receiving time of 15s, the direct broadcasting interactive interface can display the singing receiving lyrics or can not display the singing receiving lyrics in the singing receiving process of the on-line anchor broadcasting, the on-line anchor broadcasting can manually click and submit the singing audio within 15s, if the singing is not clicked and submitted for more than 15s, the terminal can automatically submit the audio collected within 15s to the server, and the server scores the on-line anchor broadcasting based on the matching degree between the submitted audio and the standard audio corresponding to the audio.

After a round of live broadcast interaction is finished, the server can calculate the total score of each online anchor and display the score ranking list, the online anchor with the highest total score can obtain all gifts paid when a user selects a target song list before, and based on the song sung by the user and the online anchor with the highest total score in the round of live broadcast interaction, a music album is automatically generated, the user and the online anchor can click and download the music album, and the online anchor which successfully downloads 3 music albums can also log on a music medal display wall in the live broadcast APP.

The following describes in detail an implementation manner of the server in the foregoing method embodiment, according to a matching degree between the second audio and the standard audio corresponding to the second part of the target audio, determining a score of the target second user for the target audio. Taking the target audio as the target song as an example, after receiving the second audio uploaded by the target second user, the server may score from two aspects of the lyric matching degree and the tone matching degree, and further determine the score of the target second user for the target audio based on the scores of the two aspects.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a process of determining a score of a target second user for a target audio by a server according to an embodiment of the present application. As shown in fig. 4, the implementation process includes the following steps:

step 401: and according to the second audio and the standard audio corresponding to the second part of the target audio, carrying out lyric matching degree identification, and determining the lyric score corresponding to the target second user.

The server may identify the lyrics corresponding to the second audio based on a voice recognition technique, determine a matching degree between the recognized lyrics and the standard lyrics corresponding to the second portion of the target audio, and determine a score of the lyrics corresponding to the target second user based on the matching degree, where it is understood that the higher the matching degree is, the higher the score of the lyrics corresponding to the target second user is.

In specific implementation, the server may extract a target audio signal sent by a target second user from the second audio; then, carrying out feature extraction operation on the target audio signal to obtain target feature parameters; further, a target acoustic model corresponding to the target audio signal is built according to the target characteristic parameters, and a standard acoustic model corresponding to the second part of the target audio is called from an acoustic model library; and finally, determining the matching degree between the target acoustic model and the standard acoustic model through a Long short-term memory (LSTM) model, and taking the matching degree as the score of the lyrics corresponding to the second user.

The speech recognition is a pattern recognition based on speech feature parameters, which can classify the input second audio by a machine learning model and match to the best result according to criteria. The process generally includes several parts, such as preprocessing, feature extraction, model construction, and model matching, and the implementation process is shown in fig. 5. The voice signal (i.e., the second audio) is collected by a microphone, is converted into a Digital signal after being sampled and Analog-to-Digital (a/D) converted, and then is subjected to pre-emphasis, framing, windowing, endpoint detection, filtering and other processing. For the voice signal obtained through preprocessing, the feature parameters which can best express the features of the voice signal are extracted according to a specific feature extraction method, and the feature parameters are arranged according to the time sequence to obtain the feature sequence of the voice signal. In the process of model construction, a corresponding acoustic model can be constructed based on the feature sequence of the voice signal, and the acoustic model is subjected to mode matching with an acoustic model corresponding to a target audio signal in an acoustic model library, so that the lyric score is obtained.

The preprocessing of the voice signal mainly comprises the following steps: 1) sampling and quantization, 2) pre-emphasis, framing and windowing, 3) speech signal analysis-frequency domain analysis. These three sections will be described separately below.

Sampling and quantizing: the microphone converts the sound collected by the microphone from a physical state into an analog electrical signal, and then converts a continuous analog signal into a discrete analog signal which is discrete in time but continuous in amplitude, which is sampling, and in general, the sampling frequency on a Personal Computer (PC) is 16Hz, and the sampling frequency on an embedded device is 8 Hz. In order to facilitate calculation, transmission and storage of relevant devices, the sampled signal is further converted into a discrete value represented by a binary, which is an a/D conversion, and in order to ensure that the a/D conversion reaches a sufficient accuracy, a uniform quantization and Pulse Code Modulation (PCM) technique may be generally used for processing, and a 16-bit quantization may be generally used for processing the sampled signal.

Pre-emphasis, framing and windowing: for the voice signals obtained after the above sampling and quantization processing, the voice signals can be further processed by adopting Voice Activity Detection (VAD) technology, and VAD is used for correctly distinguishing voice segments from non-voice segments in the presence of background noise, and is an extremely important preprocessing step in the processing scenes of voice signals such as automatic voice recognition, voice enhancement, speaker recognition and the like.

Because the high-frequency part in the voice signal has a drop of-6 dB/time range above 800Hz, the processing quality of the high-frequency part of the voice signal can be improved through pre-emphasis processing, so that the frequency spectrum is smoother, and under the normal condition, the pre-emphasis can be realized through a first-order high-pass filter. In addition, before the speech signal is analyzed, framing processing needs to be performed on the speech signal, for example, the length of each frame of the speech signal can be set to 20ms, and there is an overlap of 10ms between two adjacent frames.

Speech signal analysis-frequency domain analysis: in speech signal analysis, a commonly used frequency domain analysis method includes a filter-combined fourier transform method, and when a wideband band-pass filter is used, the frequency resolution is low, which is similar to the processing result of the windowing process in which the window length is short, and when a narrowband band-pass filter is used, the frequency resolution is high, which is similar to the processing result of the windowing process in which the window length is long. Generally, a set of filters is used to filter a speech input signal, to separate components of different center frequencies from the speech input signal, and then subsequent analysis and feature extraction processing are performed based on the components of different center frequencies.

In the feature extraction process, the feature parameters that are generally extracted may include, but are not limited to, the following: pitch period, formants, short-time average energy or amplitude, Linear Predictive Coefficient (LPC), Perceptual weighted Prediction Coefficient (PLP), short-time average zero-crossing rate, Linear Predictive Cepstral Coefficient (LPCC), autocorrelation function, Mel-Frequency Cepstral Coefficient (MFCC), wavelet transform Coefficient, Empirical Mode Decomposition Coefficient (EMD), and gamma pass Filter Coefficient (GFCC).

In the Model construction process, the server may employ a Hidden Markov Model (HMM) technique to construct an acoustic Model based on the extracted feature parameters. The markov chain is a special case of the markov random process, the state parameter and the time parameter of the markov chain are discrete, and in practical application, because the observed events and states do not correspond one to one, the corresponding relationship between the events and the states can be described by a group of probability distribution, namely an HMM model. The server can build a statistical model for a time sequence formed by characteristic parameters extracted from a speech signal through an HMM, and two interrelated random processes jointly describe the statistical characteristics of the speech signal, wherein one random process is a random process which simulates the change of the statistical characteristics of the speech signal by using a Markov chain with finite state numbers and is used for describing the transition of states, the other random process is used for describing the statistical relationship between the states and observed values, so that an observer can only see the observed values but not the states, and the random process for perceiving the existence of the states can also be regarded as a 'hidden' chain, so that the whole model is called as a 'hidden' Markov model.

In the model matching process, the LSTM model is used for speech recognition. The LSTM is a special Recurrent Neural Network (RNN), which is mainly used to solve the problems of gradient disappearance and gradient explosion during the long sequence training process, and fig. 6 is a schematic diagram of the working principle of the LSTM model, which controls the transmission state through the gating state, remembers that unimportant information needs to be memorized for a long time, and forgets, and compared with the general RNN, the LSTM has better performance in the recognition of the long sequence.

Step 402: and performing intonation matching degree identification according to the second audio and the standard audio corresponding to the second part of the target audio, and determining the intonation score corresponding to the target second user.

In the intonation recognition processing, the server also needs to convert the second audio and the standard audio into digital signals, then performs pitch recognition based on the converted digital signals, performs intonation feature extraction of the second audio and the standard audio through Fast Fourier Transform (FFT) algorithm, and matches the extracted intonation features to determine the intonation score.

During specific implementation, the server may perform fast fourier transform on the second audio and the standard audio corresponding to the second part of the target audio, respectively, to obtain frequency domain characteristics corresponding to the second audio and the standard audio, respectively; then, according to the frequency domain characteristics corresponding to the second audio and the standard audio, determining the time domain amplitude corresponding to the second audio and the standard audio respectively; and then, determining the intonation score corresponding to the second user according to the difference value between the time domain amplitude values corresponding to the second audio and the standard audio respectively.

The basic principles involved in this part of the process are described below:

the principle of frequency spectrum: according to the principle of fourier analysis, any sound can be decomposed into several or even infinite sine waves, which often contain numerous harmonic components that often vary from time to time, so that the composition of a sound is in fact very complex. In order to simplify the representation of the sound, its frequency components can be plotted, thus forming a frequency spectrum.

Fundamental frequency: which corresponds to the amplitude of vocal cord vibration, representing the pitch of the sound; the higher the fundamental frequency, the faster the vocal cords vibrate and the sharper the sound is made. Generally speaking, in a clean sound spectrum (i.e. without hoarseness and without mixing the frequencies of other sounds), the lowest distinct peak represents the fundamental frequency, and the peaks corresponding to the frequencies of integral multiples of the fundamental frequency are the harmonics generated by the resonance thereof, and the fundamental frequency can be generally represented by a frequency value f or a period value T.

The FFT algorithm is described below.

The FFT algorithm is an algorithm for converting a time domain into a frequency domain, the FFT is actually a fast algorithm of Discrete Fourier Transform (DFT), in the processing of digital signals, the FFT algorithm is usually required to obtain the frequency domain characteristics of the signals, and the purpose of the Transform is to actually obtain the same signal in the frequency domain, so that the characteristics of the signals can be analyzed more easily.

After FFT processing, a series of complex numbers are obtained, which are amplitude characteristics, not amplitudes, of the corresponding frequencies of the sound waveform. The server needs to further obtain the frequency and amplitude based on the amplitude characteristics.

Acquiring frequency: because the frequency is only related to the sampling rate and the number of points for performing fast fourier transform, the first complex number obtained after FFT corresponds to the 0Hz frequency (i.e. there is no fluctuation, also called dc component), the frequency corresponding to the second complex number obtained later is greater than 0Hz + the spectral resolution, and the calculation formula of the spectral resolution is as follows when the frequency is added every other time:

Δf＝Fs/N

wherein Fs is a sampling rate, N is a number of points in the FFT algorithm, and usually, as long as Fs and N are determined, the frequency domain is determined.

Obtaining an amplitude value: assuming that the peak value of the original sound signal is a, each point (except the first point of the dc component) in the result processed by the FFT algorithm has a modulus value N/2 times a. Since the first point is a dc component, its modulus value is N times the dc component. That is, to determine the real amplitude, it is necessary to divide the modulus values of the first point (i ═ 0) and the last point (i ═ N/2) by N and divide the modulus values of the remaining points by N/2, because the time domain amplitude corresponding to the fourier series already contains 1/N terms, but the fourier transform does not have the coefficients, so that the time domain amplitude can be obtained by dividing by N/2 after the FFT is performed.

It should be understood that, in an actual application, the server may perform step 401 first and then step 402, may also perform step 402 first and then step 402, and may also perform step 401 and step 402 at the same time, where the present application does not make any limitation on the execution order of step 401 and step 402.

Step 403: and determining the score of the target second user for the target audio according to the lyric score and the intonation score corresponding to the target second user.

After the server calculates the score of the lyrics and the intonation score corresponding to the target second user, the sum of the score of the lyrics and the intonation score can be directly calculated to serve as the score of the target second user for the target audio, and the score of the target second user for the target audio can also be obtained by weighting the score of the lyrics and the intonation score according to actual requirements.

The score of the target second user for the target audio is determined in the above manner, so that the accuracy of the determined score can be ensured, and a fair and reasonable evaluation result is provided for the second audio uploaded by the target second user.

Aiming at the live broadcast interaction method described above, the application also provides a corresponding live broadcast interaction device, so that the live broadcast interaction method is applied and realized in practice.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a live broadcast interaction device 700 corresponding to the live broadcast interaction method shown in fig. 2, where the live broadcast interaction device 700 includes:

a first audio receiving module 701, configured to receive a first audio uploaded by a first user in a target live broadcast room, where the first audio corresponds to a first portion of a target audio;

a first audio sending module 702, configured to send the first audio to a second user in the target live broadcast room;

an audio upload request receiving module 703, configured to receive an audio upload request initiated by the second user, where the audio upload request is used to characterize that the second user request uploads a second audio based on the first audio, and the second audio corresponds to a second portion of the target audio;

a target second user determining module 704, configured to determine a target second user among the second users in the target live broadcast room based on the audio upload request;

a second audio receiving module 705, configured to receive the second audio uploaded by the target second user based on the first audio;

a second audio scoring module 706, configured to determine a score of the target second user for the target audio according to a matching degree between the second audio and a standard audio corresponding to the second portion of the target audio.

Optionally, on the basis of the live broadcast interaction apparatus shown in fig. 7, in a case that a plurality of audio upload requests initiated by a plurality of second users are received, the target second user determining module 704 is specifically configured to:

and determining a second user corresponding to the audio uploading request with the earliest receiving time as the target second user according to the receiving time corresponding to each of the audio uploading requests.

Optionally, on the basis of the live interactive apparatus shown in fig. 7, referring to fig. 8, fig. 8 is a schematic structural diagram of another live interactive apparatus 800 provided in this embodiment of the present application. As shown in fig. 8, the apparatus further includes:

a candidate audio sending module 801, configured to send, to the first user, a plurality of candidate audio lists and virtual reward amounts corresponding to the candidate audio lists respectively;

a first request receiving module 802, configured to receive an audio list selection request and a virtual reward payment request initiated by the first user; the audio list selection request is used for characterizing a target audio list selected by the first user in the candidate audio lists; the virtual reward payment request is used for representing that the first user pays a target virtual reward, and the target virtual reward is matched with a virtual reward amount corresponding to the target audio list.

Optionally, on the basis of the live interactive apparatus shown in fig. 7, referring to fig. 9, fig. 9 is a schematic structural diagram of another live interactive apparatus 900 provided in this embodiment of the present application. As shown in fig. 9, the apparatus further includes:

a reward distribution module 901, configured to distribute a virtual reward to the target second user according to the score of the target second user for the target audio.

Optionally, on the basis of the live broadcast interaction device shown in fig. 7 or fig. 9, the first user uploads N first audios, where the N first audios correspond to the N target audios, respectively; n is an integer greater than 1;

the target second user determination module 704 is specifically configured to:

determining a target second user corresponding to the ith target audio based on the audio uploading request initiated by the second user for the ith first audio; the ith first audio corresponds to a first part of the ith target audio, and i is an integer which is greater than or equal to 1 and less than or equal to N;

the second audio receiving module 705 is specifically configured to:

receiving second audio uploaded by a target second user corresponding to the ith target audio based on the ith first audio;

the second audio scoring module 706 is specifically configured to:

according to the matching degree between the second audio uploaded by the target second user corresponding to the ith target audio and the standard audio corresponding to the second part of the ith target audio, determining the score of the target second user corresponding to the ith target audio for the ith target audio;

and determining that the score of the second user except the target second user corresponding to the ith target audio in the target live broadcast room is 0.

Optionally, on the basis of the live interactive apparatus shown in fig. 7 or fig. 9, the reward distribution module 901 is further configured to:

for each second user in the target live broadcast room, determining a total score corresponding to the second user according to the scores of the second user for the N target audios;

and determining the second user with the highest total score as a winning second user, and distributing a virtual reward to the winning second user.

Optionally, on the basis of the live interactive apparatus shown in fig. 7 or fig. 9, referring to fig. 10, fig. 10 is a schematic structural diagram of another live interactive apparatus 1000 provided in this embodiment of the present application. As shown in fig. 10, the apparatus further includes:

an audio album providing module 1001 configured to determine the second audio uploaded by the winning second user as a second audio to be synthesized; determining the first audio on which the second audio to be synthesized is based as first audio to be synthesized; generating an audio album according to the first audio to be synthesized and the second audio to be synthesized; and sending the audio album to the first user and the winning second user so that the first user and the winning second user can download and obtain the audio album.

Optionally, on the basis of the live interactive apparatus shown in fig. 7, referring to fig. 11, fig. 11 is a schematic structural diagram of another live interactive apparatus 1100 provided in this embodiment of the present application. As shown in fig. 11, the second audio scoring module 706 includes:

the lyric scoring module 1101 is configured to perform lyric matching degree identification according to the second audio and the standard audio corresponding to the second part of the target audio, and determine a lyric score corresponding to the target second user;

a intonation scoring module 1102, configured to perform intonation matching degree recognition according to the second audio and a standard audio corresponding to the second part of the target audio, and determine a intonation score corresponding to the target second user;

a score determining module 1103, configured to determine, according to the lyric score and the intonation score corresponding to the second user, a score of the target second user for the target audio.

Optionally, on the basis of the live broadcast interaction apparatus shown in fig. 11, the lyric scoring module 1101 is specifically configured to:

extracting a target audio signal sent by the target second user from the second audio;

carrying out feature extraction operation on the target audio signal to obtain target feature parameters;

constructing a target acoustic model corresponding to the target audio signal according to the target characteristic parameters; and calling a standard acoustic model corresponding to the second part of the target audio from the acoustic model library;

and determining the matching degree between the target acoustic model and the standard acoustic model as the score of the lyrics by memorizing an LSTM model at long time.

Optionally, on the basis of the live broadcast interaction apparatus shown in fig. 11, the intonation scoring module 1102 is specifically configured to;

respectively performing fast Fourier transform on the second audio and the standard audio corresponding to the second part of the target audio to obtain frequency domain characteristics corresponding to the second audio and the standard audio;

determining time domain amplitude values corresponding to the second audio and the standard audio according to frequency domain characteristics corresponding to the second audio and the standard audio respectively;

and determining the intonation score according to the difference value between the time domain amplitude values corresponding to the second audio and the standard audio respectively.

In the live broadcast interaction device provided by the embodiment of the application, a first user and a second user in a target live broadcast room can carry out live broadcast interaction in a way of robbing microphone to receive audio (such as receiving songs), and in the process, the first user and the second user in the target live broadcast room can fully participate in the live broadcast interaction, so that the interactive feeling of the first user and the second user in the target live broadcast room is greatly enhanced, and the interactive experience of the first user and the second user in the target live broadcast room is improved. In addition, virtual rewards are distributed to the second users according to scores of the second users for the target audio, the participation enthusiasm of the second users in the target live broadcast room can be further improved, and the liveness of the live broadcast of the network is enhanced.

The embodiment of the present application further provides a device for live broadcast interaction, where the device may specifically be a server, and the server provided in the embodiment of the present application will be described from the perspective of hardware materialization.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present disclosure. The server 1200 may vary widely in configuration or performance and may include one or more Central Processing Units (CPUs) 1222 (e.g., one or more processors) and memory 1232, one or more storage media 1230 (e.g., one or more mass storage devices) storing applications 1242 or data 1244. Memory 1232 and storage media 1230 can be, among other things, transient storage or persistent storage. The program stored in the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, to execute a series of instruction operations in the storage medium 1230 on the server 1200.

The server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input-output interfaces 1258, and/or one or more operating systems 1241, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 12.

The CPU 1222 is configured to perform the following steps:

sending the first audio to each second user in the target live broadcast room;

Optionally, the CPU 1222 may also be configured to execute the steps of any implementation manner of the live interaction method provided in the embodiment of the present application.

An embodiment of the present application further provides a computer-readable storage medium, configured to store a computer program, where the computer program is configured to execute any implementation manner of a live broadcast interaction method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes any one implementation of a live interaction method described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing computer programs.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A live interaction method, comprising:

sending the first audio to each second user in the target live broadcast room;

2. The method according to claim 1, wherein in a case where a plurality of audio upload requests initiated by a plurality of second users are received, the determining, based on the audio upload requests, a target second user corresponding to the target audio among the second users in the target live broadcast room includes:

3. The method of claim 1, wherein prior to the receiving the first audio uploaded by the first user in the target live broadcast room, the method further comprises:

sending a plurality of candidate audio lists and virtual reward amounts corresponding to the candidate audio lists to the first user;

receiving an audio list selection request and a virtual reward payment request initiated by the first user; the audio list selection request is used for characterizing a target audio list selected by the first user in the candidate audio lists; the virtual reward payment request is used for representing that the first user pays a target virtual reward, and the target virtual reward is matched with a virtual reward amount corresponding to the target audio list.

4. The method of claim 1, further comprising:

and distributing a virtual reward for the target second user according to the score of the target second user for the target audio.

5. The method of claim 1, wherein the first user uploads N of the first audio, and N of the first audio respectively correspond to N of the target audio; n is an integer greater than 1;

determining a target second user among the second users in the target live broadcast room based on the audio upload request, including:

the receiving the second audio uploaded by the target second user based on the first audio comprises:

the determining the score of the target second user for the target audio according to the matching degree between the second audio and the standard audio corresponding to the second part of the target audio comprises:

the method further comprises the following steps:

6. The method of claim 5, further comprising:

7. The method of claim 6, further comprising:

determining the second audio uploaded by the winning second user as a second audio to be synthesized;

determining the first audio on which the second audio to be synthesized is based as first audio to be synthesized;

generating an audio album according to the first audio to be synthesized and the second audio to be synthesized;

and sending the audio album to the first user and the winning second user so that the first user and the winning second user can download and obtain the audio album.

8. The method of claim 1, wherein determining the score of the target second user for the target audio according to a degree of match between the second audio and standard audio corresponding to the second portion of the target audio comprises:

according to the second audio and the standard audio corresponding to the second part of the target audio, carrying out lyric matching degree identification, and determining the lyric score corresponding to the target second user;

performing intonation matching degree recognition according to the second audio and the standard audio corresponding to the second part of the target audio, and determining the intonation score corresponding to the target second user;

and determining the score of the target second user for the target audio according to the lyric score and the intonation score corresponding to the target second user.

9. The method of claim 8, wherein the performing lyric matching recognition according to the standard audio corresponding to the second audio and the second portion of the target audio to determine the score of the lyric corresponding to the target second user comprises:

10. The method according to claim 8, wherein the performing intonation matching recognition according to the standard audio corresponding to the second audio and the second part of the target audio, and determining the intonation score corresponding to the target second user comprises:

11. A live interaction device, the device comprising:

12. A live interactive system, the system comprising: the system comprises a first terminal facing a first user, a second terminal facing a second user and a server;

the server for executing the live interaction method of any one of claims 1-10.

13. The system of claim 12, wherein the first terminal is further configured to:

responding to the triggered target activity participation operation, and displaying a plurality of candidate audio lists and virtual reward amounts corresponding to the candidate audio lists;

responding to the triggered audio list selection operation, generating an audio list selection request and sending the audio list selection request to the server; the audio list selection request is used for characterizing a target audio list selected by the first user in the candidate audio lists;

responding to the triggered virtual reward payment operation, generating a virtual reward payment request and sending the virtual reward payment request to the server; the virtual reward payment request is used for representing that the first user pays a target virtual reward, and the target virtual reward is matched with a virtual reward amount corresponding to the target audio list.

14. An apparatus, comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the live interaction method of any one of claims 1 to 10 according to the computer program.

15. A computer-readable storage medium for storing a computer program for executing the live interaction method of any one of claims 1 to 10.