CN112261435B

CN112261435B - Social interaction method, device, system, equipment and storage medium

Info

Publication number: CN112261435B
Application number: CN202011231193.3A
Authority: CN
Inventors: 张艳军; 武斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2022-04-08
Anticipated expiration: 2040-11-06
Also published as: CN112261435A

Abstract

The embodiment of the application discloses a social interaction method, a social interaction device, a social interaction system, social interaction equipment and a storage medium, wherein the method comprises the following steps: calling a first audio from an audio library to be paired in response to a channel connecting word operation triggered by a first user; the audio library to be paired is used for storing audio to be paired uploaded by a second user on the target social contact platform, and the audio to be paired corresponds to the first part of the dubbing speech segment; sending a first audio to a first user; receiving a second audio uploaded by the first user based on the first audio, wherein the second audio corresponds to a second part of the target dubbing speech segment; forming a target dubbing combination by the first user and the second user uploading the first audio; and determining a target dubbing score corresponding to the target dubbing combination according to the first audio and the second audio. The method provides a social interaction mode through the word-joining, and can bring fresh social interaction experience for the user.

Description

Social interaction method, device, system, equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a social interaction method, apparatus, system, device, and storage medium.

Background

With the rapid development of internet technology, nowadays, various social network platforms are layered endlessly, and users on the social network platforms can perform social interaction with other users based on the social network platforms.

Currently, the social interaction modes provided by the social network platforms are generally single, for example, most social network platforms generally only support social interaction modes such as chat interaction, posting interaction and the like, and for recent network live broadcast platforms with high popularity, the supported social interaction modes are generally only limited to that a user sends a barrage in the process of watching live broadcast, gives a virtual gift to a main broadcast and the like.

Because the social interaction modes provided by the social network platforms are basically the same, it is difficult to bring a fresh social interaction experience to the user, and even the retention of the user on the social network platform is affected. Therefore, how to enrich social interaction modes in a social network platform and bring fresh social interaction experience to users becomes a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the application provides a social interaction method, a social interaction device, a social interaction system, social interaction equipment and a storage medium, and provides a social interaction mode through a channel word, so that fresh social interaction experience can be brought to a user.

In view of the above, a first aspect of the present application provides a social interaction method, including:

responding to a channel connection word operation triggered by a first user, and calling a first audio from an audio library to be paired; the audio library to be paired is used for storing audio to be paired uploaded by a second user on a target social contact platform, and the audio to be paired corresponds to a first part of the dubbing speech segment;

sending the first audio to the first user;

receiving second audio uploaded by the first user based on the first audio; the second audio corresponds to a second portion of the target dubbing speech segment;

combining the first user and the second user uploading the first audio into a target dubbing combination;

and determining a target dubbing score corresponding to the target dubbing combination according to the first audio and the second audio.

A second aspect of the present application provides a social interaction device, the device comprising:

the audio calling module is used for calling a first audio from an audio library to be paired in response to a channel word connecting operation triggered by a first user; the audio library to be paired is used for storing audio to be paired uploaded by a second user on a target social contact platform, and the audio to be paired corresponds to a first part of the dubbing speech segment;

an audio sending module, configured to send the first audio to the first user;

the audio receiving module is used for receiving second audio uploaded by the first user based on the first audio; the second audio corresponds to a second portion of the target dubbing speech segment;

a user pairing module for combining the first user and the second user uploading the first audio into a target dubbing combination;

and the audio scoring module is used for determining a target dubbing score corresponding to the target dubbing combination according to the first audio and the second audio.

A third aspect of the present application provides a social interaction system, which includes a terminal device and a server;

the terminal device is used for receiving a first audio sent by the server when detecting that a user triggers a channel word receiving operation, and uploading a second audio input by the user based on the first audio to the server;

the terminal device is further configured to receive a dubbing speech-line segment sent by the server when detecting that the user triggers a dubbing speech-line operation, and upload audio input by the user for a first part of the dubbing speech-line segment to the server; or when detecting that the user triggers the self-defined speech matching operation, uploading audio input by the user aiming at the first part of the self-defined speech segment to the server;

the server is configured to execute the social interaction method according to the first aspect.

A fourth aspect of the present application provides an apparatus comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the steps of the social interaction method according to the first aspect as described above, according to the computer program.

A fifth aspect of the present application provides a computer-readable storage medium for storing a computer program for performing the steps of the social interaction method of the first aspect.

A sixth aspect of the application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of the social interaction method according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a social interaction method which innovatively provides a new social interaction form, namely, a user on a social platform can perform social interaction in a channel word receiving mode. Specifically, the server responds to a channel word connecting operation triggered by a first user on the target social contact platform, and correspondingly calls a first audio from an audio library to be paired, wherein the audio library to be paired stores audio to be paired uploaded by a second user on the target social contact platform, and the audio to be paired corresponds to a first part of a channel word segment to be dubbed; then, the server sends the called first audio to the first user and receives a second audio uploaded by the first user based on the first audio, wherein the second audio is an audio corresponding to a second part of the target dubbing speech-line segment under the condition that the first audio is an audio corresponding to a first part of the target dubbing speech-line segment; furthermore, the server may combine the first user and a second user who uploads the first audio into a target dubbing combination, and determine a target dubbing score corresponding to the target dubbing combination according to the first audio and the second audio. Therefore, the first user and the second user on the target social platform can perform social interaction in a channel word receiving mode, so that the social interaction mode supported by the target social platform is richer, the interest of the social interaction is improved, the user is provided with fresh social interaction experience, and the user can reserve the target social platform.

Drawings

Fig. 1 is a schematic structural diagram of a social interaction system according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a social interaction method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of sensitive word detection processing provided in an embodiment of the present application;

fig. 4 is a schematic flowchart of extracting audio features according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another social interaction method provided in the embodiment of the present application;

fig. 6 is a schematic interface diagram of a keyword spotting operation provided in the embodiment of the present application;

fig. 7 is a schematic interface diagram of a collocating word operation provided in the embodiment of the present application;

fig. 8 is a schematic interface diagram of a dubbing voting plaza according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a first social interaction device according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a second social interaction device according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a third social interaction device according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a fourth social interaction device according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a fifth social interaction device according to an embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of a sixth social interaction device according to an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Aiming at the problems that a social interaction mode provided by a social network platform in the related technology is single, fresh social interaction experience is difficult to bring to a user, retention of the user on the social network platform is difficult to guarantee, and the like, the embodiment of the application provides a social interaction method.

Specifically, in the social interaction method provided by the embodiment of the application, the server may call a first audio from the audio library to be paired in response to a channel-joining operation triggered by a first user on the target social platform; the audio library to be paired is used for storing audio to be paired uploaded by a second user on the target social platform, and the audio to be paired corresponds to the first part of the dubbing speech segment. Then, the server can send the called first audio to the first user and receive second audio uploaded by the first user based on the first audio; in case the first audio is audio corresponding to a first part of a target dubbing speech piece, this second audio should correspond to a second part of the target dubbing speech piece. Furthermore, the server may combine the first user and the second user who uploads the first audio into a target dubbing combination, and determine a target dubbing score corresponding to the target dubbing combination according to the first audio and the second audio.

In the social interaction method, the first user and the second user on the target social platform realize social interaction in a channel word receiving mode, compared with social interaction modes (such as chatting, posting, bullet sending, giving virtual gifts and the like) provided by various social network platforms in the related technology, the mode of carrying out social interaction by channel word receiving is more interesting, fresh social interaction experience can be brought to the users on the target social platform, and further the use users of the target social platform can be retained.

It should be understood that the social interaction method provided by the embodiment of the present application may be applied to a backend server of a social application program, where the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing a social interaction service.

In order to facilitate understanding of the social interaction method provided in the embodiment of the present application, the social interaction system provided in the embodiment of the present application is introduced below in combination with an application scenario in which the social interaction method is applicable.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a social interaction system provided in the embodiment of the present application. As shown in fig. 1, the social interaction system includes a terminal device 110 and a server 120, and the terminal device 110 and the server 120 can communicate with each other through a network. A target application program supporting access to a target social platform is run on the terminal device 110, and the target application program can support a user to trigger a channel word receiving operation and a channel word matching operation; the server 120 is configured to respond to a keyword-receiving operation and a keyword-matching operation triggered by the user through the target application, and accordingly execute the social interaction method provided in the embodiment of the present application.

The working principle of the social interaction system shown in fig. 1 when the user triggers the operations of receiving speech-sounds and assigning speech-sounds is described below.

When detecting that the user triggers a keyword receiving operation through the target application program, the terminal device 110 may send a keyword receiving request to the server 120; for example, when detecting that the user touches a keyword-catching control in the target application program interface, the terminal device 110 may consider that the user currently triggers a keyword-catching operation, and further generate a keyword-catching request to send to the server 120.

After receiving the keyword-joining request, the server 120 may call a first audio from the audio library to be paired; specifically, the server may randomly call a to-be-paired audio from the to-be-paired audio library as the first audio, or call a specific to-be-paired audio from the to-be-paired audio library as the first audio. It should be noted that the to-be-paired audio library stores to-be-paired audio uploaded by the second user on the target social contact platform through triggering the dubbing word operation, the to-be-paired audio usually corresponds to the first part of the dubbing word segment, and different to-be-paired audio may correspond to the same dubbing word segment or different dubbing word segments.

After the server 120 extracts the first audio from the audio library to be paired, the first audio is fed back to the terminal device 110 through the network. Further, the terminal device 110 may play the received first audio in response to a user-triggered audio playing operation, and receive a second audio input by the user based on the first audio in response to a user-triggered audio recording operation, and transmit the second audio to the server 120. It should be noted that, if the first audio returned by the server 120 is an audio corresponding to the first part of the target dubbing speech-line segment, the second audio input by the user based on the first audio should correspond to the second part of the target dubbing speech-line segment; it should be understood that, in order to facilitate the user to input the second audio based on the first audio, the terminal device 110 may display the content of the second portion of the target dubbing speech segment in the course of the user inputting the second audio, so that the user inputs the audio corresponding to the second portion of the target dubbing speech segment.

After receiving the second audio uploaded by the terminal device 110, the server 120 may confirm that the first audio is connected, and further, may combine the second user uploading the first audio and the first user uploading the second audio into a target dubbing combination, and determine a target dubbing score corresponding to the target dubbing combination according to a matching condition between the first audio and the second audio.

When detecting that the second user triggers the speech configuration operation through the target application program, the terminal device 110 may send a speech configuration request to the server 120; for example, when the terminal device 110 detects that the second user touches a speech-channel matching control in the target application program interface, it may be considered that the second user currently triggers a speech-channel matching operation, and then generates a speech-channel matching request to send to the server 120.

After receiving the dubbing phrase request, the server 120 may retrieve the dubbing phrase fragment from the phrase fragment library for storing the dubbing phrase fragment; specifically, the server may randomly call a dubbing speech segment from the speech segment library, or may call a specific dubbing speech segment from the speech segment library. Further, the server 120 may transmit the dubbing phrase fragment retrieved by the terminal device 110 to the terminal device 110 through the network, and the terminal device 110 may display the dubbing phrase fragment after receiving the dubbing phrase fragment and receive audio input by a second user based on the first part of the dubbing phrase fragment. After detecting that the second user completes audio input, the terminal device 110 sends the audio received by the second user to the server 120 through the network, so that the server 120 stores the received audio as the audio to be paired in the audio library to be paired.

In addition, the method and the device for receiving the audio can also support a second user to self-define the speech segment and receive the audio input by the second user based on the first part of the self-defined speech segment. Specifically, when detecting that the second user triggers the custom speech matching operation through the target application program, the terminal device 110 may directly receive the audio input by the second user based on the first part of the custom speech segment, and send the received audio to the server 120 through the network, so that the server 120 stores the audio as the audio to be matched in the audio library to be matched.

It should be understood that the structure of the social interaction system shown in fig. 1 is merely an example, and in practical applications, the social interaction system provided in the embodiment of the present application is not limited to the structure shown in fig. 1, for example, the terminal device 110 in the social interaction system is not limited to the smartphone shown in fig. 1, but may also be a computer, a tablet computer, a Personal Digital Assistant (PDA), or the like. The structure of the social interaction system provided in the embodiments of the present application is not limited in any way. In addition, the social interaction system provided by the embodiment of the application can be applied to various social network supporting scenes such as live scenes and game scenes.

The social interaction method provided by the present application is described in detail below by way of a method embodiment.

Referring to fig. 2, fig. 2 is a schematic flowchart of a social interaction method provided in the embodiment of the present application. For convenience of description, the following embodiments are described with a server as an execution subject. As shown in fig. 2, the social interaction method includes the following steps:

step 201: responding to a channel connection word operation triggered by a first user, and calling a first audio from an audio library to be paired; the audio library to be paired is used for storing audio to be paired uploaded by a second user on a target social contact platform, and the audio to be paired corresponds to a first part of the dubbing speech segment.

In practical application, a first user can trigger a channel word receiving operation through a target application program supporting access to a target social platform in terminal equipment, and after the terminal equipment detects that the first user triggers the channel word receiving operation, a channel word receiving request is correspondingly generated and sent to a server through a network. And after receiving the channel word connecting request, the server calls a to-be-paired audio frequency from the to-be-paired audio frequency library as a first audio frequency, wherein the first audio frequency is the audio frequency according to which the first user executes the channel word connecting operation.

For example, a control for triggering a keyword-joining operation may be loaded on the target application program interface, and once the terminal device detects that the user touches the control, the terminal device may think that the user has triggered the keyword-joining operation, and further generate a keyword-joining request and send the keyword-joining request to the server; for example, the target application program may provide an interactive play entry control of "pick-up word", and if it is detected that the user clicks the interactive play entry control of "pick-up word", the terminal device may regard that the user has triggered the pick-up word operation. In addition, for a target application program special for supporting social interaction through a channel switching word, once the terminal device detects that a user opens the target application program, the user can be considered to trigger the channel switching word operation. In practical application, of course, other ways of triggering the operation of receiving the speech channel by the user can be provided, and the application does not limit the way of triggering the operation of receiving the speech channel by the user.

After receiving a channel-switched word-switched request generated by the terminal equipment in response to the channel-switched word-switched operation triggered by the user, the server can correspondingly call the first audio from the audio library to be paired. In one possible implementation manner, the server may randomly extract an audio to be paired from an audio library to be paired as the first audio. In another possible implementation manner, the server may also extract a specific audio to be paired from the audio library to be paired as the first audio; for example, if a user triggers a keyword splicing operation for a specific dubbing phrase segment through a terminal device, the terminal device sends a keyword splicing request correspondingly generating an identifier including the dubbing phrase segment to a server, and after receiving the keyword splicing request, the server can call a to-be-paired audio corresponding to a first part of the dubbing phrase segment from an audio library to be paired as a first audio; for another example, assuming that a user triggers a keyword-receiving operation for a specific user on a target social platform through a terminal device, the terminal device sends a keyword-receiving request correspondingly generated to include an identifier of the specific user to a server, and after receiving the keyword-receiving request, the server may retrieve, from an audio library to be paired, audio to be paired uploaded by the specific user as first audio. The method for calling the first audio from the audio library to be paired by the server is not limited in any way.

Optionally, in practical application, the method provided by the embodiment of the application can be applied to a live broadcast platform to support social interaction between a live broadcast anchor on the live broadcast platform and a watching user. Specifically, the server may configure, in response to a channel word connection operation triggered by the first user based on the target live broadcast room, a first calling weight for the audio to be paired uploaded by the anchor user in the target live broadcast room in the audio library to be paired, and configure a second calling weight for the audio to be paired uploaded by other users in the audio library to be paired except for the anchor user, where the first calling weight is greater than the second calling weight; and then, based on the respective calling weight corresponding to each audio to be paired in the audio library to be paired, calling the first audio from the audio library to be paired.

Illustratively, when a first user watches live content in a target live broadcast room through a live broadcast application program, the first user can click a 'channel word receiving' interactive play method entrance control provided by the target live broadcast room, and after detecting that the first user triggers the operation, the terminal device correspondingly generates a channel word receiving request comprising an identifier of the target live broadcast room and sends the channel word receiving request to the server. After receiving the channel word receiving request, the server determines the anchor user of the target live broadcast room according to the identification of the target live broadcast room carried in the channel word receiving request, then configures a first calling weight with larger size for the audio to be paired uploaded by the anchor user in the audio library to be paired, and configures a second calling weight with smaller size for the audio to be paired uploaded by other users except the other users outside the anchor user in the audio library to be paired. Furthermore, the audio to be paired is called from the audio library to be paired as the first audio based on the calling weight corresponding to each audio to be paired in the audio library to be paired, and it should be understood that the greater the calling weight corresponding to the audio to be paired is, the greater the possibility that the audio to be paired is called as the first audio is.

Certainly, in practical applications, after detecting that the first user triggers a channel word connection operation through the target live broadcast room, the server may also directly call, in the audio library to be paired, the audio to be paired uploaded by the main broadcast of the target live broadcast room as the first audio, and send the first audio to the first user.

So, through above-mentioned mode, the social interaction mode that will connect the speech-sounds combines together with live platform closely for the social interaction mode that live platform supported is abundanter, helps the anchor user on the live platform to watch the user with live and carry out social interaction better, can also promote even live and watch the conversion rate that the user paid attention to, sent the gift etc. of user to the anchor user.

In addition, in order to further enhance the relevance between the anchor user and the live viewing user on the live platform, the method provided by the embodiment of the application can also set a specific audio recommendation rule to be paired, so that the live viewing user can only receive the audio to be paired uploaded by the anchor user on the live platform, and the anchor user can only receive the audio to be paired uploaded by the live viewing user on the live platform. When the first user is a live watching user on the target social platform, calling the audio to be paired uploaded by the anchor user on the target social platform from the audio library to be paired as a first audio; and when the first user is a main broadcasting user on the target social platform, calling the audio to be paired uploaded by the live broadcasting watching user on the target social platform from the audio library to be paired as the first audio.

Specifically, after detecting that the first user triggers a channel word receiving operation through a live application program, the terminal device may generate a channel word receiving request including identity information of the first user, and send the channel word receiving request to the server, where the identity information may represent an identity of the first user on a live platform. After the server receives the channel word receiving request, whether the first user is a live broadcast watching user on a live broadcast platform or an anchor user on the live broadcast platform can be identified according to identity information carried in the channel word receiving request, if the first user is identified to be the live broadcast watching user on the live broadcast platform, first audio is extracted from to-be-paired audio uploaded by the anchor user on the live broadcast platform stored in an audio library to be paired, and if the first user is identified to be the anchor user on the live broadcast platform, the first audio is extracted from to-be-paired audio uploaded by the live broadcast watching user on the live broadcast platform stored in the audio library to be paired.

It should be noted that the audio library to be paired in the above is substantially a database for storing audio to be paired, where the audio to be paired is audio uploaded by a user on a target social platform through a trigger action, and the audio to be paired usually corresponds to a first part of a certain dubbing passage segment, for example, it is assumed that a certain dubbing passage segment is "once there is a sense of truth and passion my without being cherished before me, i pursue remorse when i loses, and the most distressing affairs among people. If you have a chance to get again the day, I will say three words for that girl: i love you, if we want to add a period to the love, i want ten thousand years, the first part of the dubbing lines segment is that "there is a sense of truth and passion in our face, i do not keep, i trace back when i lose, and the most painful things among people are too much", then the user on the target social platform can upload the audio corresponding to the first part of the dubbing lines segment by triggering the dubbing lines operation.

The embodiment of the present application provides two implementation manners for obtaining the audio to be paired, and the two implementation manners are respectively introduced below.

In a first implementation, the server provides the user with his stored speech segment so that the user uploads the audio to be paired based on the first part of the speech segment. That is, the server may call the dubbing channel-word fragment from the channel-word fragment library in response to the dubbing channel-word operation triggered by the second user, send the dubbing channel-word fragment to the second user, receive audio uploaded by the second user for the first part of the dubbing channel-word fragment as audio to be paired, and store the audio to be paired in the audio library to be paired.

Specifically, the second user may trigger a speech configuration operation through a target application program in the terminal device, for example, if the target application program interface bears a "speech configuration" interactive play entry control, the second user may trigger the speech configuration operation by clicking the "speech configuration" interactive play entry control.

After detecting that the user triggers the operation of matching speech-lines, the terminal equipment correspondingly generates a speech-line matching request and sends the speech-line matching request to the server. After receiving the dubbing phrase request, the server calls dubbing phrase fragments from the phrase fragment library; for example, the server may randomly call a dubbing speech segment from the speech segment library, or may call a specific dubbing speech segment from the speech segment library in response to a dubbing speech operation triggered by the user; for example, assuming that a user triggers a dubbing operation for a certain movie, the terminal device generates a dubbing request including an identifier of the movie, and the server may call a classical-stage segment in the movie from the library of stage segments as a dubbing-stage segment.

After the dubbing lines fragment is retrieved by the server, the dubbing lines fragment can be returned to the terminal device. After receiving the dubbing speech-line segment, the terminal device displays the dubbing speech-line segment, and in order to enable the second user to clearly know the content according to which the second user records the audio, the terminal device may highlight the first part of the dubbing speech-line segment, for example, the first part of the dubbing speech-line segment is displayed in a bold manner or in an enlarged manner. Furthermore, the terminal device may respond to an audio recording operation triggered by the second user through the speech dubbing control on the target application program interface, receive the audio input by the second user for the first part of the dubbing speech segment, and send the audio to the server through the network. After receiving the audio, the server may store the audio as the audio to be paired in an audio library to be paired.

It should be understood that, in practical applications, if the second user is not satisfied with the dubbing speech-line segment fed back by the server, the second user may also trigger an operation of replacing the dubbing speech-line segment through the speech-line replacement control on the target application program interface, and the terminal device responds to the operation and correspondingly generates a speech-line replacement request to send to the server. After receiving the speech-line replacement request, the server can recall a dubbing speech-line segment from the speech-line segment library and feed back the dubbing speech-line segment to the terminal equipment.

In a second implementation, the server may directly receive the audio to be paired uploaded by the user based on the first part of the custom speech segment. That is, the server may respond to the custom speech matching operation triggered by the second user, receive the audio corresponding to the first part of the custom speech segment uploaded by the second user as the audio to be paired, and store the audio to be paired in the audio library to be paired.

Specifically, the target application program may also support the second user to upload the audio corresponding to the custom speech segment. That is, the second user may trigger the custom channel-word configuration operation through a target application program in the terminal device, for example, a target application program interface may bear a custom channel-word configuration control, and the second user may trigger the custom channel-word configuration operation by clicking the custom channel-word configuration control. After detecting that the second user triggers the self-defined speech-matching operation, the terminal device can respond to the audio recording operation triggered by the second user through the speech-matching control on the target application program interface and receive the audio input by the second user aiming at the first part of the self-defined speech-matching segment. Furthermore, the terminal device can send the audio received by the terminal device to the server, so that the server stores the audio as the audio to be paired in the audio library to be paired.

It should be understood that, in the case that the second user triggers the custom speech configuration operation, the target application may further support the second user to input a custom speech segment, divide the first part and the second part for the custom speech segment, and upload the custom speech segment to the server. If the server subsequently calls the audio to be paired corresponding to the first part of the self-defined speech segment from the audio library to be paired as the first audio, the server can also call the self-defined speech segment and simultaneously feed back the self-defined speech segment and the first audio to the first user.

It should be understood that, in practical applications, the server may set a limit on the recording time for the audio to be paired, for example, the time for the audio to be paired recorded by the second user is limited to 6s to 30s, and the like, and the application does not set any limit on the recording time herein.

It should be understood that the above two implementation manners of uploading to-be-paired audio through a speech matching operation are merely examples, and in practical applications, the embodiments of the present application may also perform other forms of speech matching operations, and the present application does not limit any specific form of speech matching operations.

Step 202: transmitting the first audio to the first user.

After the server calls the first audio from the audio library to be paired, the first audio can be returned to the terminal equipment sending the speech-joining request through the network. After receiving the first audio, the terminal device may automatically play the first audio, or may correspondingly play the first audio in response to an audio playing operation triggered by the first user.

It should be understood that, in order to facilitate the first user to perform a word segmentation operation based on the first audio, after the server calls the first audio, the server may further call a target dubbing speech segment corresponding to the first audio and a division manner of the target dubbing speech segment, and further feed back the division manner of the target dubbing speech segment and the target dubbing speech segment to the terminal device. Correspondingly, after receiving the target dubbing stop-word segment, the terminal device can display the target dubbing stop-word segment and highlight the first part of content corresponding to the first audio in the target dubbing stop-word segment when playing the first audio.

It should be understood that, in practical applications, if the first user is not satisfied with the first audio fed back by the server, the first user may also trigger an operation of replacing the audio through an audio replacing control on the target application program interface, and the terminal device generates an audio replacing request accordingly in response to the operation and sends the audio replacing request to the server. After receiving the audio replacing request, the server can recall other audio to be paired in the audio library to be paired as the first audio and feed back the first audio to the terminal equipment.

It should be understood that, in practical applications, the first user may also trigger an audio collection operation for the first audio fed back by the server, that is, the first user may click an audio collection control on the target application program interface to trigger the collection operation for the first audio when the terminal device plays the first audio. After detecting that the user triggers the collection operation for the first audio, the terminal device may generate an audio collection request including an identifier of the first audio and send the audio collection request to the server. After receiving the audio collection request, the server may add the first audio to the audio collection of the first user, and the first user may subsequently retrieve the first audio directly from the audio collection of the first user.

Step 203: receiving second audio uploaded by the first user based on the first audio; the second audio corresponds to a second portion of the target dubbing speech segment.

After the terminal device finishes playing the first audio, the terminal device may further respond to an audio recording operation triggered by the first user, receive a second audio recorded by the first user based on the first audio, and send the second audio to the server through the network. It should be understood that where the first audio is audio corresponding to a first portion of a target dubbing stop segment uploaded by a second user on the target social platform, the second audio recorded by the first user based on the first audio should correspond to a second portion of the target dubbing stop segment.

For example, the target application program interface may include a keyword catching control for triggering a keyword catching operation with respect to currently received audio, and when the terminal device detects that the first user clicks the keyword catching control, it may be considered that the first user has triggered the keyword catching operation based on the first audio, and then the second audio recording interface is jumped to display. The second audio recording interface also comprises an audio recording control, and the terminal device can respond to the touch control of the first user on the audio recording control and receive a second audio input by the first user based on the first audio.

It should be understood that, in practical applications, the server may set a limit on the recording time for the second audio, for example, set the recording time for the second audio between 6s and 30s, and the like, and this application does not set any limit on the recording time.

It should be noted that, in the method provided in this embodiment of the present application, the server may unify the received audio uploaded by the user as the audio to be processed, that is, the server may regard both the audio uploaded by the user through the speech-word-connection operation and the audio uploaded through the speech-word-matching operation as the audio to be processed, and further perform sensitive word detection and audio filtering processing on the audio to be processed.

Specifically, the server may perform voice separation processing on the audio to be processed to obtain a target audio; then, carrying out voice-to-word conversion processing on the target audio to obtain a target text; and further detecting whether preset sensitive words exist in the target text, and if so, returning an audio uploading failure message to the uploading user of the audio to be processed.

The above implementation is described below with reference to the flowchart shown in fig. 3. The speech separation process can be specifically classified into three categories according to the difference of interference signals existing in the audio to be processed: when the interference signal existing in the audio to be processed is a noise signal, a voice enhancement technology can be adopted for voice separation processing; when the interference signals existing in the audio to be processed are other voices, a multi-speaker separation technology can be adopted for voice separation processing; when the interference signal existing in the audio to be processed is the reflected wave of the target human voice, the voice separation processing can be performed by adopting the dereverberation technology. The server separates the target audio corresponding to the target human voice from the background noise in the audio to be processed by performing voice separation processing on the audio to be processed, so that subsequent processing is performed on the basis of the target audio, and the interference that the background noise in the audio to be processed is lower than that of the subsequent processing is reduced.

After the target audio is obtained, the server can convert the target audio into a corresponding target text by adopting the existing voice-character conversion technology. Further, whether the target text contains sensitive words preset by the target social platform is detected; if so, filtering the to-be-processed audio containing the sensitive words, and returning an audio uploading failure prompt message to the terminal equipment which uploads the to-be-processed audio, so as to prompt the user that the uploaded audio contains the sensitive words through the audio uploading failure prompt message, wherein the audio uploading fails; otherwise, if not, it indicates that the audio to be processed is uploaded successfully, and may continue to perform subsequent processing on the audio to be processed.

The process of filtering the audio to be processed may be specifically implemented by a keyword search Application Program Interface (API), where the keyword search API may provide a service of uploading long audio and asynchronously identifying keywords (i.e., sensitive words in this Application). The keyword retrieval API adopts an https transmission mode, retrieval is completed through a post request, and the whole function is divided into two interfaces: a voice uploading interface and a call-back interface.

The request parameters transmitted to the voice upload interface may include app _ id (application identifier), format (audio format, such as PCM, WAV, and the like), callback _ url (user callback url), key _ words (to-be-recognized keywords, that is, preset sensitive words), speed _ url (to-be-recognized voice download address), and the like. The app _ id, the callback _ url, the format and the key _ words are transmitted to the voice uploading interface in advance by a server of the target social platform, and the speech _ url is transmitted to the voice uploading interface each time the sensitive word recognition processing is performed on the audio to be processed.

And the callback interface is used for returning the result to the server of the target social platform according to the callback _ url uploaded to the voice uploading interface. The callback request comprises the id of the current identification task, the identified audio clip, the identified keyword list (namely the sensitive word category), the reliability score of the identification result, the occurrence time of the identified keyword in the audio clip and the like.

Step 204: combining the first user and the second user uploading the first audio into a target dubbing combination.

The server receives a second audio uploaded by the first user through the terminal device, and after confirming that the second audio does not include the sensitive word, the server can confirm that the uploading user of the second audio (namely, the first user) is successfully paired with the uploading user of the first audio (namely, the second user), and further, the first user and the second user form a target dubbing combination.

Step 205: and determining a target dubbing score corresponding to the target dubbing combination according to the first audio and the second audio.

Furthermore, the server may determine a target dubbing score corresponding to the target dubbing combination according to a matching condition between the first audio and the second audio received by the server, and correspondingly feed back the target dubbing score to the first user and the second user.

In one possible implementation manner, the server may determine the target dubbing score corresponding to the target dubbing combination according to a similarity between the audio feature of the first audio and the audio feature of the second audio. That is, the server may extract the audio feature of the first audio as the first audio feature, and extract the audio feature of the second audio as the second audio feature; then, the similarity between the first audio characteristic and the second audio characteristic is determined, and further, the target dubbing score is determined according to the similarity.

In specific implementation, the server may use cepstrum parameters extracted based on Mel Frequency Cepstral Coefficients (MFCCs) algorithm as audio features, where Mel frequencies are provided according to human auditory experimental results, and the Frequency fmel of sound perceived by human ears and the actual Frequency f of sound have a nonlinear relationship as shown in the following formula:

fmel＝2595log10(1+f/700)

the MFCC can then use the nonlinear relationship between the two to calculate the spectral characteristics of the audio, and the specific flow of the MFCC algorithm is shown in FIG. 4.

After the server adopts an MFCC algorithm to extract audio features aiming at the first audio, the audio features are regarded as first audio features; and after the server adopts the MFCC algorithm to extract the audio features for the second audio, regarding the audio features as the second audio features. In practical application, of course, the server may also perform audio feature extraction processing on the first audio and the second audio in other manners to obtain corresponding first audio features and second audio features, and the manner in which the server extracts the audio features from the first audio and the second audio is not limited in this application.

After the server extracts the first audio feature and the second audio feature, the server may calculate a similarity between the first audio feature and the second audio feature, and further map the similarity to a corresponding scoring system, such as a corresponding percentile numerical value, to obtain a target dubbing score.

In another possible implementation manner, the server may further obtain an original sound audio of the target dubbing speech segment, and further determine a target dubbing score corresponding to the target dubbing combination according to a similarity between the audio feature of the first audio, the audio feature of the second audio, and the audio feature of the original sound audio. Namely, the server can obtain the acoustic audio corresponding to the target dubbing speech segment; then, extracting the audio features of the first audio as first audio features, extracting the audio features of the second audio as second audio features, and extracting the audio features of the original sound audio as third audio features; furthermore, the server may determine a similarity between the first audio feature and the second audio feature as a first similarity, determine a similarity between the first audio feature and the third audio feature as a second similarity, and determine a similarity between the second audio feature and the third audio feature as a third similarity; and finally, determining the target dubbing score according to the first similarity, the second similarity and the third similarity.

In specific implementation, the server may use the MFCC algorithm described above to perform audio feature extraction processing on the first audio, the second audio, and the acoustic audio, respectively, so as to obtain corresponding first audio feature, second audio feature, and third audio feature.

After determining the first audio feature, the second audio feature, and the third audio feature, the server may calculate a similarity between the first audio feature and the second audio feature (i.e., a first similarity), a similarity between the first audio feature and the third audio feature (i.e., a second similarity), and a similarity between the second audio feature and the third audio feature (i.e., a third similarity), respectively. Further, the target dubbing score is calculated according to the first similarity, the second similarity and the third similarity, for example, the server may configure corresponding weights for the first similarity, the second similarity and the third similarity, then perform weighted summation processing on the first similarity, the second similarity and the third similarity according to the weights corresponding to the first similarity, the second similarity and the third similarity, and further map a result obtained after the weighted summation processing to the corresponding target dubbing score.

It should be understood that, in practical applications, the server may determine the target dubbing score according to the first audio and the second audio in addition to the target dubbing score corresponding to the target dubbing combination according to the above two implementation manners, and the determination manner of the target dubbing score is not limited in this application.

In practical application, in order to further enhance the social interaction feeling between the first user and the second user, the server may further send a pairing success prompt message to the first user and the second user; the pairing success prompt message is used for prompting the first user that the first user and the second user form the target dubbing combination and prompting the second user that the second user and the first user form the target dubbing combination, and the pairing success prompt message can also include the target dubbing score corresponding to the target dubbing combination.

Optionally, in order to further enhance the interest of the social interaction mode through word-joining, in the method provided in the embodiment of the present application, the server may further splice the first audio and the second audio to obtain a target dubbing audio; then, the target dubbing audio is published to a dubbing audio voting platform, so that the users on the target social platform can trigger the vote operation for the target dubbing audio; accordingly, the server may update the vote count corresponding to the dubbing audio on the dubbing audio voting platform in response to a vote casting operation triggered by the user on the target social platform for the dubbing audio on the dubbing audio voting platform.

In specific implementation, after receiving the second audio, the server may correspondingly splice a first audio corresponding to the first portion and a second audio corresponding to the second portion according to a front-back order of the first portion and the second portion in the target dubbing speech segment, so as to obtain the target dubbing audio. And then, the target audio is published on a dubbing audio voting platform for display, the dubbing audio voting platform faces the users on the target social platform, the users on the target social platform can audition the dubbing audio published on the dubbing audio voting platform, and trigger the vote operation according to the favorite dubbing audio. Accordingly, the server may update the vote count corresponding to the dubbing audio on the dubbing audio voting platform in response to the vote casting operation triggered by the user on the target social platform, for example, if the server detects that a user on the target social platform triggers a vote casting operation for the target dubbing audio, then add one to the original vote count of the target dubbing audio accordingly.

It should be understood that, in practical applications, the server may set a voting limit for the users on the target social platform, for example, each user is limited to vote for voting only 3 times per day, and the application does not set any limit on the set voting limit conditions.

It should be noted that, in order to enable a user on the target social contact platform to more conveniently hear a better-quality dubbing audio on the dubbing audio voting platform, when the server publishes the target dubbing audio on the dubbing audio voting platform, a target ranking corresponding to the target dubbing score may be determined according to the score corresponding to each dubbing audio on the dubbing audio voting platform and the target dubbing score corresponding to the target dubbing audio, and then the target dubbing audio is published at a position corresponding to the target ranking on the dubbing audio voting platform.

Specifically, the server may sort, according to the order of the dubbing scores from large to small, the score corresponding to each dubbing audio on the dubbing audio voting platform and the target dubbing score corresponding to the target dubbing audio to obtain a target sequence, and determine the sort of the target dubbing score in the target sequence as the target sort. Further, a location on the dubbing audio voting platform corresponding to the target ranking is determined and the target dubbing audio is published at that location. Therefore, dubbing audio with higher dubbing score can be published at a position closer to the front on the dubbing audio voting platform, and users on the target social platform can conveniently listen to the dubbing audio with higher quality preferentially.

After the preset voting period is finished, the server can determine a winning dubbing combination according to the respective corresponding voting number of each dubbing audio on the dubbing audio voting platform; and virtual prizes are assigned to users in the winning dubbing combination. For example, assuming that the preset voting period is one month, after the one month is finished, the server may count the number of votes obtained by each dubbing audio released to the dubbing audio voting platform in the month, and further determine the dubbing combination in which the number of votes obtained is the largest as a winning dubbing combination, or determine the dubbing combination in which the obtained votes are ranked several digits before the name as the winning dubbing combination; in turn, the users in the winning dubbing combination are assigned virtual rewards, e.g., virtual gifts, virtual currency, virtual medals, etc. on the target social platform.

Based on the social interaction method, the first user and the second user on the target social platform can realize social interaction through a channel word receiving mode, compared with social interaction modes (such as chatting, posting, bullet sending, giving virtual gifts and the like) provided by various social network platforms in the related technology, the mode of carrying out social interaction through channel words is more interesting, fresh social interaction experience can be brought to the users on the target social platform, and further the use users of the target social platform can be retained.

In order to further understand the social interaction method provided in the embodiment of the present application, in the following, taking an example of applying the social interaction method to a live broadcast platform, and combining an exemplary interface diagram, implementation processes when a user triggers a channel word receiving operation and triggers a channel word matching operation are respectively exemplarily described.

Fig. 5 is a flowchart illustrating a social interaction method according to an embodiment of the present disclosure. The left half of the flow in the flow diagram corresponds to a speech reception operation triggered by a user, that is, after the user triggers the speech reception operation, the terminal device may receive an audio corresponding to a first part of a certain dubbing speech segment provided by the server, play the audio, further receive an audio corresponding to a second part of the dubbing speech segment recorded by the user based on the audio, and upload the received audio to the server. The right flow in the flow diagram corresponds to the speech matching operation triggered by the user, that is, after the user triggers the speech matching operation, the speech provided by the server can be selected to perform the speech matching operation, and the speech matching operation can also be selected to be freely exerted, so that the audio corresponding to the first part of the speech matching segment is recorded by the terminal equipment and uploaded to the server to wait for being picked up by other users on the live broadcast platform. And after the audio is connected, uploading the spliced audio to a dubbing voting square, and selecting the optimal dubbing combination by counting the votes for praise of each dubbing audio in the dubbing voting square.

In practical application, when a user watches live content in a target live broadcast room through a live broadcast platform, the user can enter a play home page through a play entry "match a station word" provided by the target live broadcast room, and the play home page may be a page supporting the user to trigger a station word receiving operation, as shown in (a) in fig. 6. When detecting that a user enters a play main page through the triggering of 'dubbing station words' of a play entrance, the terminal equipment correspondingly sends a station word receiving request to the server and receives a first audio randomly called by the server from an audio library to be paired and a target dubbing station word segment corresponding to the first audio. Furthermore, the terminal device can play the received first audio, display the target dubbing speech-line segment at the same time, and highlight the first part of content corresponding to the first audio in the target dubbing speech-line segment.

After the terminal equipment plays the first audio, if the user does not want to perform the channel word connecting operation based on the first audio, the server can be triggered to provide other audio for the terminal equipment in a mode of assigning the channel word card. If the user collects the first audio, the collection of the first audio can be triggered through a collection control in the touch page.

After the terminal device finishes playing the first audio, if the user wants to perform a keyword splicing operation based on the first audio, the terminal device may touch a keyword splicing control in a page shown in (a) in fig. 6, and then the terminal device switches and displays the page shown in (b) in fig. 6, a second part of content in the target keyword segment is highlighted in the page, the page includes an audio recording control, the user may input a second audio which is recorded based on the first audio and corresponds to the second part of content in the target keyword segment by long-pressing the audio recording control, and after confirming that the input of the second audio is finished, the audio recording control is released, and accordingly, the terminal device may send the second audio received by the terminal device to the server.

After receiving the second audio, the server scores a target dubbing combination consisting of uploading users of the two pieces of audio according to the similarity between the audio characteristics of the second audio and the audio characteristics of the first audio to obtain a target dubbing score, and respectively feeds back the target dubbing score to the uploading users of the two pieces of audio. In addition, the server can correspondingly release the target dubbing audio formed by splicing the first audio and the second audio on the dubbing voting plaza according to the target dubbing score.

In practical application, after entering a main page of a play method, if a user does not want to perform a channel word receiving operation but performs a channel word matching operation, the user may trigger to enter a channel word matching page supporting performing a channel word matching operation by touching a channel word matching control in a page shown in (a) in fig. 6, as shown in (a) in fig. 7. When detecting that the user enters the operation of the speech-matching page triggered by the speech-matching control, the terminal equipment correspondingly sends a speech-matching request to the server and receives speech-matching segments randomly called by the server from the speech-matching segment library. Further, the terminal device may display the dubbing speech-line segment and highlight the first part of the content in the dubbing speech-line segment. The dubbing channel phrase page comprises an audio input control, a user can trigger and input audio corresponding to the first part of the dubbing channel phrase segment by pressing the audio input control for a long time, the audio input control is released after the user confirms that the audio input is finished, and accordingly the terminal equipment transmits the received audio to the server so that the server can store the audio as the audio to be paired into an audio library to be paired.

In addition, the user can also trigger recording of the audio corresponding to the customized dubbing speech segment by touching the control "play freely" in the page shown in (a) in fig. 7, and upload the audio to the server, so that the server stores the audio as the audio to be paired in the audio library to be paired.

If the audio to be paired uploaded by a certain user is picked up by other users, the server may correspondingly feed back a pairing success prompt message to the user, as shown in (b) in fig. 7, to prompt the user that the audio uploaded by the user is picked up, and the spliced audio has been issued to a dubbing voting square, which may require a friend to vote for the spliced audio. In practical application, the audio to be paired uploaded by the user through the channel matching operation can also be directly issued to the dubbing voting plaza so as to be used for the other users on the live broadcast platform to vote for the other users.

The dubbing voting square corresponds to the page shown in fig. 8, the page includes dubbing audios arranged in sequence from high to low in dubbing score, a user on the live broadcast platform can trigger a vote operation on the dubbing audios on the dubbing voting square, and after the preset vote cycle is finished, the obtained dubbing combination with the maximum vote count can obtain the honor of the best dubbing combination and other virtual rewards provided by the live broadcast platform.

For the social interaction method described above, the present application also provides a corresponding social interaction device, so that the social interaction method is applied and implemented in practice.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a social interaction device 900 corresponding to the social interaction method shown in fig. 2. As shown in fig. 9, the social interaction apparatus 900 includes:

the audio calling module 901 is configured to, in response to a channel word joining operation triggered by a first user, call a first audio from an audio library to be paired; the audio library to be paired is used for storing audio to be paired uploaded by a second user on a target social contact platform, and the audio to be paired corresponds to a first part of the dubbing speech segment;

an audio sending module 902, configured to send the first audio to the first user;

an audio receiving module 903, configured to receive a second audio uploaded by the first user based on the first audio; the second audio corresponds to a second portion of the target dubbing speech segment;

a user pairing module 904 for combining the first user and the second user uploading the first audio into a target dubbing combination;

an audio scoring module 905, configured to determine, according to the first audio and the second audio, a target dubbing score corresponding to the target dubbing combination.

Optionally, on the basis of the social interaction device shown in fig. 9, referring to fig. 10, fig. 10 is a schematic structural diagram of another social interaction device 1000 provided in the embodiment of the present application. As shown in fig. 10, the apparatus further includes:

a speech extracting module 1001 configured to extract a dubbing speech segment from the speech segment library in response to a speech-matching operation triggered by the second user;

a speech sending module 1002, configured to send the dubbing speech segment to the second user;

the audio receiving module 903 is further configured to receive audio uploaded by the second user for the first part of the dubbing speech segment, where the audio is used as the audio to be paired.

Optionally, on the basis of the social interaction apparatus shown in fig. 9, the audio receiving module 903 is further configured to:

and responding to the self-defined speech-line matching operation triggered by the second user, and receiving the audio corresponding to the first part of the self-defined speech-line segment uploaded by the second user as the audio to be matched.

Optionally, on the basis of the social interaction device shown in fig. 9, referring to fig. 11, fig. 11 is a schematic structural diagram of another social interaction device 1100 provided in the embodiment of the present application. As shown in fig. 11, the apparatus further includes:

an audio splicing module 1101, configured to splice the first audio and the second audio to obtain a target dubbing audio;

an audio publishing module 1102, configured to publish the target dubbing audio to a dubbing audio voting platform, so that a user on the target social platform triggers a vote operation for the target dubbing audio;

a vote counting module 1103, configured to update a vote count corresponding to a dubbing audio on the dubbing audio voting platform in response to a vote casting operation triggered by a user on the target social platform for the dubbing audio on the dubbing audio voting platform.

Optionally, on the basis of the social interaction apparatus shown in fig. 11, the target dubbing audio corresponds to the target dubbing score, and the audio publishing module 1102 is specifically configured to:

determining target sequences corresponding to the target dubbing scores according to the dubbing scores corresponding to the dubbing audios on the dubbing audio voting platform and the target dubbing scores;

and issuing the target dubbing audio at a position corresponding to the target sequence on the dubbing audio voting platform.

Optionally, on the basis of the social interaction device shown in fig. 11, referring to fig. 12, fig. 12 is a schematic structural diagram of another social interaction device 1200 provided in the embodiment of the present application. As shown in fig. 12, the apparatus further includes:

a winning combination determining module 1201, configured to determine a winning dubbing combination according to the respective corresponding vote numbers of the dubbing audios on the dubbing audio voting platform after a preset voting period is ended;

a prize distribution module 1202 for distributing virtual prizes for users in the winning dubbing combination.

Optionally, on the basis of the social interaction device shown in fig. 9, referring to fig. 13, fig. 13 is a schematic structural diagram of another social interaction device 1300 provided in the embodiment of the present application. As shown in fig. 13, the apparatus further includes:

a pairing success prompting module 1301, configured to send a pairing success prompting message to the first user and the second user who uploads the first audio; the pairing success prompt message is used for prompting the first user and the second user uploading the first audio to form the target dubbing combination, and the pairing success prompt message comprises the target dubbing score.

Optionally, on the basis of the social interaction apparatus shown in fig. 9, the audio retrieving module 901 is specifically configured to:

responding to a channel connection word operation triggered by the first user based on a target live broadcast room, configuring a first calling weight for the audio to be paired uploaded by the main broadcast user of the target live broadcast room in the audio library to be paired, and configuring a second calling weight for the audio to be paired uploaded by other users except the main broadcast user in the audio library to be paired; the first calling weight is greater than the second calling weight;

and calling the first audio from the audio library to be paired based on the calling weight corresponding to each audio to be paired in the audio library to be paired.

when the first user is a live watching user on the target social contact platform, calling a to-be-paired audio uploaded by a main broadcasting user on the target social contact platform from the to-be-paired audio library to serve as the first audio;

and when the first user is the anchor user on the target social platform, calling the audio to be paired uploaded by the live watching user on the target social platform from the audio library to be paired as the first audio.

Optionally, on the basis of the social interaction apparatus shown in fig. 9, the received audio uploaded by the user is used as the audio to be processed, see fig. 14, and fig. 14 is a schematic structural diagram of another social interaction apparatus 1400 provided in the embodiment of the present application. As shown in fig. 14, the apparatus further includes:

a voice separation module 1401, configured to perform voice separation processing on the audio to be processed to obtain a target audio;

a text conversion module 1402, configured to perform voice-to-text conversion processing on the target audio to obtain a target text;

a sensitive word detection module 1403, configured to detect whether a preset sensitive word exists in the target text, and if so, return an audio upload failure prompt message to the user uploading the audio to be processed.

Optionally, on the basis of the social interaction apparatus shown in fig. 9, the audio scoring module 905 is specifically configured to:

extracting the audio features of the first audio as first audio features, and extracting the audio features of the second audio as second audio features;

determining a similarity between the first audio feature and the second audio feature;

and determining the target dubbing score according to the similarity.

acquiring the acoustic audio corresponding to the target dubbing speech segment;

extracting the audio features of the first audio as first audio features, extracting the audio features of the second audio as second audio features, and extracting the audio features of the original audio as third audio features;

determining a similarity between the first audio feature and the second audio feature as a first similarity; determining a similarity between the first audio feature and the third audio feature as a second similarity; determining a similarity between the second audio feature and the third audio feature as a third similarity;

and determining the target dubbing score according to the first similarity, the second similarity and the third similarity.

Based on the social interaction device, the first user and the second user on the target social platform can realize social interaction through a channel word receiving mode, compared with social interaction modes (such as chatting, posting, popping up, presenting virtual gifts and the like) provided by various social network platforms in the related technology, the mode of carrying out social interaction through channel words is more interesting, fresh social interaction experience can be brought to the user on the target social platform, and further the user of the target social platform can be favorably kept.

The embodiment of the present application further provides a device for social interaction, where the device may specifically be a server, and the server provided in the embodiment of the present application will be described below from the perspective of hardware materialization.

Referring to fig. 15, fig. 15 is a schematic structural diagram of a server 1500 according to an embodiment of the present disclosure. The server 1500 may vary widely in configuration or performance and may include one or more Central Processing Units (CPUs) 1522 (e.g., one or more processors) and memory 1532, one or more storage media 1530 (e.g., one or more mass storage devices) storing applications 1542 or data 1544. Memory 1532 and storage media 1530 may be, among other things, transient or persistent storage. The program stored on the storage medium 1530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 1522 may be provided in communication with the storage medium 1530, executing a series of instruction operations in the storage medium 1530 on the server 1500.

The server 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input-output interfaces 1558, and/or one or more operating systems 1541, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 15.

The CPU 1522 is configured to execute the following steps:

sending the first audio to the first user;

Optionally, the CPU 1522 may also be configured to execute steps of any implementation manner of the social interaction method provided in the embodiment of the present application.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute any one implementation manner of the social interaction method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute any one implementation of the social interaction method described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing computer programs.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of social interaction, the method comprising:

sending the first audio to the first user;

determining a target dubbing score corresponding to the target dubbing combination according to the similarity between the audio features of the first audio and the audio features of the second audio;

wherein, the determining a target dubbing score corresponding to the target dubbing combination according to the similarity between the audio features of the first audio and the audio features of the second audio comprises:

and determining the target dubbing score according to the similarity.

2. The method of claim 1, wherein the audio to be paired is obtained by:

responding to the dubbing channel phrase operation triggered by the second user, and calling dubbing channel phrase fragments from a channel phrase fragment library;

sending the dubbing speech segment to the second user;

receiving audio uploaded by the second user aiming at the first part of the dubbing stop word segment, and taking the audio as the audio to be paired;

alternatively, the first and second electrodes may be,

3. The method of claim 1, further comprising:

splicing the first audio and the second audio to obtain a target dubbing audio;

publishing the target dubbing audio to a dubbing audio voting platform so that a user on the target social platform triggers a vote operation for the target dubbing audio;

and updating the vote number corresponding to the dubbing audio on the dubbing audio voting platform in response to the vote casting operation triggered by the user on the target social platform aiming at the dubbing audio on the dubbing audio voting platform.

4. The method of claim 3, wherein the target dubbing audio corresponds to the target dubbing score, and wherein publishing the target dubbing audio to a dubbing audio voting platform comprises:

5. The method according to claim 3 or 4, characterized in that the method further comprises:

after the preset voting period is finished, determining a winning dubbing combination according to the respective voting numbers of the dubbing audios on the dubbing audio voting platform;

virtual rewards are allocated for users in the winning dubbing combination.

6. The method of claim 1, further comprising:

sending a pairing success prompt message to the first user and the second user uploading the first audio; the pairing success prompt message is used for prompting the first user and the second user uploading the first audio to form the target dubbing combination, and the pairing success prompt message comprises the target dubbing score.

7. The method of claim 1, wherein invoking the first audio from the audio library to be paired in response to the first user-triggered pick-up operation comprises:

8. The method of claim 1 or 7, wherein the retrieving the first audio from the audio library to be paired in response to the first user-triggered channel-word-connect operation comprises:

9. The method of claim 1 or 2, wherein the received audio uploaded by the user is taken as the audio to be processed, the method further comprising:

carrying out voice separation processing on the audio to be processed to obtain a target audio;

performing voice-to-word conversion processing on the target audio to obtain a target text;

and detecting whether preset sensitive words exist in the target text, and if so, returning an audio uploading failure prompt message to the uploading user of the audio to be processed.

10. The method according to claim 1, wherein determining the target dubbing score corresponding to the target dubbing combination according to the similarity between the audio feature of the first audio and the audio feature of the second audio comprises:

extracting the audio features of the first audio as first audio features, extracting the audio features of the second audio as second audio features, and extracting the audio features of the acoustic audio as third audio features;

11. A social interaction device, comprising:

an audio sending module, configured to send the first audio to the first user;

an audio scoring module, configured to determine a target dubbing score corresponding to the target dubbing combination according to a similarity between an audio feature of the first audio and an audio feature of the second audio, where the determining a target dubbing score corresponding to the target dubbing combination according to a similarity between an audio feature of the first audio and an audio feature of the second audio includes: extracting the audio features of the first audio as first audio features, and extracting the audio features of the second audio as second audio features; determining a similarity between the first audio feature and the second audio feature; and determining the target dubbing score according to the similarity.

12. The social interaction system is characterized by comprising terminal equipment and a server;

the server for executing the social interaction method of any one of claims 1 to 10.

13. A social interaction device, comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the social interaction method of any one of claims 1 to 10 according to the computer program.

14. A computer-readable storage medium for storing a computer program for executing the social interaction method of any one of claims 1 to 10.