CN114221940B

CN114221940B - Audio data processing method, system, device, equipment and storage medium

Info

Publication number: CN114221940B
Application number: CN202111521010.6A
Authority: CN
Inventors: 郭启行; 贾磊; 张洪彬
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2023-12-29
Anticipated expiration: 2041-12-13
Also published as: CN114221940A

Abstract

The disclosure provides an audio data processing method, an audio data processing system, an audio data processing device, an audio data processing equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of voice interaction, virtual digital persons and the like. The specific implementation scheme is as follows: the cloud mobile phone client acquires target audio data and sends the target audio data to the cloud mobile phone server; the cloud mobile phone server determines virtual object audio data and virtual object video data according to the target audio data; transmitting the virtual object audio data and the virtual object video data to a cloud mobile phone client; and then the cloud mobile phone client plays the virtual object audio data and the virtual object video data.

Description

Audio data processing method, system, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical field of voice interaction, virtual digital man, and the like.

Background

Traditional customer service systems are based on PSTN (Public Switched Telephone Network ). The user needs to dial the telephone of the call center through a mobile phone, a fixed telephone and the like so as to access the manual customer service or the intelligent customer service of the back end, and interaction can only be performed based on the voice dimension. In recent years, virtual digital person technology has rapidly developed due to breakthrough of deep learning algorithms, and many companies upgrade traditional audio customer service to "digital staff" customer service based on virtual digital persons. The "digital staff" customer service based on the virtual digital person can provide the user with an interactive way of two dimensions of audio and video.

Disclosure of Invention

The present disclosure provides an audio data processing method, system, apparatus, device, storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided an audio data processing method including: collecting target audio data and sending the target audio data to the cloud mobile phone server; and receiving the virtual object audio data and the virtual object video data from the cloud mobile phone server, and playing the virtual object audio data and the virtual object video data.

According to another aspect of the present disclosure, there is provided an audio data processing method including: receiving target audio data from a cloud mobile phone client; determining virtual object audio data and virtual object video data according to the target audio data; and sending the virtual object audio data and the virtual object video data to the cloud mobile phone client.

According to another aspect of the present disclosure, there is provided an audio data processing apparatus including: the receiving module is used for receiving the target audio data from the cloud mobile phone client; the processing module is used for determining virtual object audio data and virtual object video data according to the target audio data; and the sending module is used for sending the virtual object audio data and the virtual object video data to the cloud mobile phone client.

According to another aspect of the present disclosure, there is provided an audio data processing apparatus including: the audio acquisition module is used for acquiring target audio data; the communication module is used for sending the target audio data to a cloud mobile phone server and receiving virtual object audio data and virtual object video data from the cloud mobile phone server; the audio playing module is used for playing the virtual object audio data; and the display module is used for playing the virtual object video data.

Another aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods shown in the embodiments of the present disclosure.

According to another aspect of the disclosed embodiments, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the methods shown in the disclosed embodiments.

According to another aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the steps of the method shown in the disclosed embodiments.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a system architecture of an audio data processing method, apparatus, electronic device, and storage medium according to an embodiment of the present disclosure;

fig. 2 schematically illustrates a flow chart of an audio data processing method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of an audio data processing method according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a method of determining virtual human audio data and virtual human video data in accordance with another embodiment of the present disclosure;

fig. 5 schematically illustrates a block diagram of an audio data processing device according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a block diagram of an audio data processing device according to another embodiment of the present disclosure; and

FIG. 7 schematically illustrates a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The system architecture of the system architecture method, apparatus, electronic device, and storage medium provided by the present disclosure will be described below in conjunction with fig. 1.

Fig. 1 is a system architecture diagram of an audio data processing method, apparatus, electronic device, and storage medium according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, the system architecture 100 includes a cloud handset client 110 and a cloud handset server 120.

Cloud mobile client 110 may include an audio acquisition module, a communication module, an audio playback module, and a display module. The audio acquisition module may be used to acquire audio data, and may include a microphone, for example. The communication module may be used for network communication with other electronic devices, and may include, for example, a network card, a modem, a wireless communication transceiver, and the like. The audio playing module may be used to play audio data, and may include a speaker, for example. The display module may be used to play video data and may include, for example, a display or the like.

According to embodiments of the present disclosure, cloud mobile phone client 110 may include, for example, a smart phone, a tablet, a laptop portable computer, a desktop computer, and the like.

The cloud handset server 120 may be a server that provides cloud handset services for the cloud handset client 110. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server or VPS for short) are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

The cloud handset server 120 may deploy one or more virtual handset instances. Each virtual handset instance corresponds to one cloud handset client, for example, a virtual handset instance corresponding to cloud handset client 110 may be deployed in cloud handset server 120. Various application programs, such as a communication application 121 and a virtual object application 122, may be configured in each virtual handset instance. The communication application 121 may be used to process data transmission with the cloud mobile phone client, for example, to receive target audio data from the cloud mobile phone client and send audio and video data of the virtual object to the cloud mobile phone client. The virtual object application 122 may be used to process audio data and generate corresponding virtual objects. The virtual handset instance may include a microphone module, a speaker module, and a display module corresponding to the microphone, speaker, and display of the cloud handset client 110. The microphone module, the speaker module, and the display module may be invoked using application program interfaces (Application Programming Interface, APIs) 121 corresponding to the microphone module, the speaker module, and the display module, respectively.

A communication connection 130 may be established between the cloud mobile client 110 and the cloud mobile server 120, and data may be transmitted between the cloud mobile client 110 and the cloud mobile server 120 through the communication connection 130 to perform interaction. The communication connection 130 may include, for example, a real-time audio video communication (Reel Time Communications, RTC) connection, for example. Through real-time audio and video communication connection, full duplex interaction between the cloud mobile phone client 110 and the cloud mobile phone server 120 can be achieved, and interaction experience is improved.

According to embodiments of the present disclosure, the cloud handset client 110 may collect target audio data through a microphone and send the target audio data to the cloud handset server 120. The cloud handset server 120 may receive target audio data from the cloud handset client 110. The virtual object audio data and virtual object video data may then be determined from the target audio data and sent to the cloud handset client 110. The cloud mobile phone client 110 may receive the virtual object audio data and the virtual object video data from the cloud mobile phone server 120, play the virtual object audio data through a speaker, and play the virtual object video data through a display.

According to embodiments of the present disclosure, virtual object audio data may be used to generate sound for a virtual object, and virtual object video data may be used to generate appearance and actions for the virtual object. Wherein the virtual object may comprise, for example, a virtual digital person. The virtual digital person refers to a virtual digital person with a digitized appearance generated according to technologies such as computer vision, voice synthesis and the like.

According to embodiments of the present disclosure, the target audio data may include, for example, voice data of a user, the virtual object audio data may include, for example, interactive voice for the voice data, and the virtual object video data may include, for example, interactive avatar or interactive action for the voice data. By playing the virtual object audio data and the virtual object video data, the virtual object may be presented to the user for interaction with the user.

According to the audio data processing system disclosed by the embodiment of the invention, the advantage of low cost of the cloud mobile phone technology is utilized, and the large-scale popularization is facilitated.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

Fig. 2 schematically shows a flowchart of an audio data processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the audio data processing method 200 may include the cloud mobile phone client collecting target audio data and transmitting the target audio data to the cloud mobile phone server in operation S210.

Then, in operation S220, the cloud handset server receives target audio data from the cloud handset client.

In operation S230, the cloud mobile phone server determines virtual object audio data and virtual object video data according to the target audio data.

In operation S240, the cloud mobile phone server sends the virtual object audio data and the virtual object video data to the cloud mobile phone client.

In operation S250, the cloud mobile client receives virtual object audio data and virtual object video data from the cloud mobile server, and plays the virtual object audio data and the virtual object video data.

According to embodiments of the present disclosure, the target audio data may include, for example, voice data. The cloud mobile phone client may collect the original voice data through a microphone, and then send the original voice data to the cloud mobile phone server. The cloud mobile phone server can decode the original voice data, so that target audio data are obtained.

According to embodiments of the present disclosure, virtual object audio data may be used to generate sound for a virtual object, and virtual object video data may be used to generate appearance and actions for the virtual object. Wherein the virtual object may comprise, for example, a virtual digital person.

According to the embodiment of the disclosure, the cloud mobile phone server executes the determination of the virtual object audio data and the virtual object video data to generate the virtual object, so that the computing resources of the cloud mobile phone client can be saved relative to the method for generating the virtual object at the cloud mobile phone client. In addition, one cloud mobile phone server can deploy a plurality of virtual mobile phone instances, and virtual objects can be generated for a plurality of cloud mobile phone clients. Compared with a method for generating virtual objects by using a special cloud server, the method has lower cost.

In addition, according to the audio data processing method of the embodiment of the disclosure, only the cloud mobile phone server is required to be set, and the cloud mobile phone client does not need to be modified, so that the expansibility is high, and various cloud mobile phone clients can be supported. On the other hand, when technology updating is needed, only the cloud mobile phone server is needed to be updated, and each cloud mobile phone server is not needed to be updated, so that the updating is simple and convenient.

According to another embodiment of the present disclosure, the cloud mobile phone server may be configured with a communication application and a virtual object application. The communication application may be used to process data transmission with the cloud mobile phone client, for example, to receive target audio data from the cloud mobile phone client and to send audio and video data of the virtual object to the cloud mobile phone client. The virtual object application may be used to process the target audio data and generate a virtual object. Based on this, fig. 3 schematically shows a flowchart of an audio data processing method according to another embodiment of the present disclosure.

As shown in fig. 3, the audio data processing method 300 may include operations S310 to S370. The method may be performed, for example, by the cloud handset server shown above.

In operation S310, target audio data from a cloud handset client is received using a communication application.

Then, in operation S320, the target audio data is input to the microphone input interface of the cloud handset server using the communication application.

In operation S330, the virtual object application is utilized to obtain the target audio data through the microphone output interface of the cloud mobile phone server.

In operation S340, virtual object audio data and virtual object video data are determined from the target audio data using the virtual object application.

In operation S350, the virtual object application is utilized to input the virtual object audio data to the speaker input interface of the cloud mobile phone server, and to input the virtual object video data to the display input interface of the cloud mobile phone server.

In operation S360, virtual object audio data corresponding to the target audio data is acquired through the speaker output interface of the cloud mobile phone server by using the communication application, and virtual object video data corresponding to the target audio data is acquired through the display output interface of the cloud mobile phone server.

In operation S370, the virtual object audio data and the virtual object video data are transmitted to the cloud handset client using the communication application.

According to embodiments of the present disclosure, a microphone input interface may be used to play input audio data through a microphone module. The microphone output interface may be used to obtain audio data for playback through the microphone module. The display input interface may be used to display the input video data through the display module. The display output interface may be used to obtain video data for display by the display module.

According to the embodiment of the disclosure, after the target audio data is acquired from the microphone output interface, the communication application may also be used to perform audio encoding on the target audio data, and then the encoded target audio data is sent to the cloud mobile phone client.

According to the embodiment of the disclosure, the communication application can be operated in the background of the cloud mobile phone service end, and the virtual person application can be operated in the foreground of the cloud mobile phone service end.

According to the embodiment of the disclosure, the virtual person originally running in the cloud mobile phone client can be applied and deployed to the cloud mobile phone server without any transformation. The communication application and the virtual person application can be bridged through the application program interface of the cloud mobile phone server, so that the transmission of data such as target audio data, virtual object video data and the like is realized, and the realization cost is low.

According to another embodiment of the present disclosure, a real-time audio and video communication connection may be pre-established between the cloud mobile phone client and the cloud mobile phone server. Based on the above, when the cloud mobile phone client sends the target audio data to the cloud mobile phone server, the cloud mobile phone client can send the target audio data to the cloud mobile phone server through real-time audio and video communication connection between the cloud mobile phone client and the cloud mobile phone server. Correspondingly, the cloud mobile phone server can receive target audio data from the cloud mobile phone client through real-time audio and video communication connection with the cloud mobile phone client.

In addition, when the cloud mobile phone server sends the virtual object audio data and the virtual object video data to the cloud mobile phone client, the cloud mobile phone server can send the virtual object audio data and the virtual object video data to the cloud mobile phone client through real-time audio and video communication connection between the cloud mobile phone server and the cloud mobile phone client. Correspondingly, the cloud mobile phone client can receive virtual object audio data and virtual object video data from the cloud mobile phone server through real-time audio and video communication connection with the cloud mobile phone server.

According to the embodiment of the disclosure, through real-time audio and video communication connection, full duplex interaction between the cloud mobile phone client and the cloud mobile phone server can be realized, and interaction experience is improved.

According to another embodiment of the present disclosure, in determining the virtual person audio data and the virtual person video data, for example, a voice interaction system outside the cloud mobile phone server may perform analysis and other processes on the target audio data, so as to reduce the occupation of computing power on the cloud mobile phone server. Based on this, fig. 4 schematically shows a flowchart of a method of determining virtual human audio data and virtual human video data according to another embodiment of the present disclosure.

As shown in fig. 4, the method 400 of determining virtual human audio data and virtual human video data may include the cloud mobile phone server transmitting target audio data to the voice interaction system in operation S410.

In operation S420, the voice interaction system receives the target audio data from the cloud mobile phone server.

In operation S430, the voice interaction system performs voice recognition on the target audio data to obtain a recognition result.

In operation S440, the voice interaction system generates response data according to the recognition result.

In operation S450, the voice interaction system sends the response data to the cloud mobile phone server.

In operation S460, the cloud handset server receives response data for the target audio data from the voice interaction system.

In operation S470, the cloud mobile phone server determines virtual human audio data and virtual human video data according to the response data.

According to embodiments of the present disclosure, a voice interaction system may recognize a voice contained in target audio data according to an automatic voice recognition technique (Automatic Speech Recognition, ASR), resulting in a text containing voice information as a recognition result. The dialog system may then be utilized to determine the response text to which the recognition result corresponds. The response Text may then be converted into Speech data, i.e. as response data, using Text To Speech (TTS) technology. And sending the response data to the cloud mobile phone server. The cloud mobile phone server can generate corresponding virtual person audio data and virtual person video data according to the response data.

According to the embodiment of the disclosure, the voice interaction system is used for carrying out voice recognition on the target audio data to generate the response data, so that the calculated amount of the cloud mobile phone server can be reduced, and the interaction quality can be improved.

Fig. 5 schematically shows a block diagram of an audio data processing device according to an embodiment of the present disclosure.

As shown in fig. 5, the audio data processing device 500 includes a receiving module 510, a processing module 520, and a transmitting module 530.

The receiving module 510 is configured to receive target audio data from a cloud mobile phone client.

The processing module 520 is configured to determine virtual object audio data and virtual object video data according to the target audio data.

And the sending module 530 is configured to send the virtual object audio data and the virtual object video data to the cloud mobile phone client. According to the embodiment of the disclosure, the cloud mobile phone server can be configured with a communication application. The processing module is further configured to perform the following operations with the communication application: and receiving target audio data from the cloud mobile phone client. And then inputting the target audio data into a microphone input interface of the cloud mobile phone server.

According to an embodiment of the present disclosure, receiving target audio data from a cloud handset client may include, for example: and receiving target audio data from the cloud mobile phone client through real-time audio and video communication connection with the cloud mobile phone client.

According to embodiments of the present disclosure, a cloud handset server may be configured with a virtual object application. The processing module may also be configured to perform the following operations with the virtual object application: and acquiring target audio data through a microphone output interface of the cloud mobile phone server. Virtual object audio data and virtual object video data are then determined from the target audio data. And inputting the virtual object audio data into a loudspeaker input interface of the cloud mobile phone server, and inputting the virtual object video data into a display input interface of the cloud mobile phone server.

According to embodiments of the present disclosure, the processing module may also be configured to perform the following operations with the virtual object application: virtual object audio data corresponding to the target audio data are obtained through a loudspeaker output interface of the cloud mobile phone server, and virtual object video data corresponding to the target audio data are obtained through a display output interface of the cloud mobile phone server. And then the virtual object audio data and the virtual object video data are sent to the cloud mobile phone client.

According to an embodiment of the present disclosure, transmitting virtual object audio data and virtual object video data to a cloud handset client may include, for example: and transmitting the virtual object audio data and the virtual object video data to the cloud mobile phone client through real-time audio and video communication connection with the cloud mobile phone client.

Fig. 6 schematically illustrates a block diagram of an audio data processing device according to another embodiment of the present disclosure.

As shown in fig. 6, the audio data processing device 600 includes an audio acquisition module 610, a communication module 620, an audio playing module 630, and a display module 640.

The audio collection module 610 is configured to collect target audio data.

The communication module 620 is configured to send the target audio data to the cloud mobile phone server, and receive the virtual object audio data and the virtual object video data from the cloud mobile phone server.

The audio playing module 630 is configured to play the virtual object audio data.

And a display module 640 for playing the virtual object video data. According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 schematically illustrates a block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, an audio data processing method. For example, in some embodiments, the audio data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When a computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the audio data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the audio data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An audio data processing method is applied to a cloud mobile phone server, the cloud mobile phone server is configured with a communication application and a virtual object application, the communication application is used for processing data transmission with a cloud mobile phone client, and the virtual object application is used for processing target audio data, and the method comprises the following steps:

the following operations are performed with the communication application:

receiving the target audio data from the cloud mobile phone client; and

inputting the target audio data into a microphone input interface of the cloud mobile phone server;

the following operations are performed with the virtual object application:

acquiring the target audio data through a microphone output interface of the cloud mobile phone server;

determining virtual object audio data and virtual object video data according to the target audio data; and

inputting the virtual object audio data into a loudspeaker input interface of the cloud mobile phone server, and inputting the virtual object video data into a display input interface of the cloud mobile phone server;

the following operations are performed with the communication application:

the virtual object audio data corresponding to the target audio data are obtained through a loudspeaker output interface of the cloud mobile phone server, and the virtual object video data corresponding to the target audio data are obtained through a display output interface of the cloud mobile phone server; and

and sending the obtained virtual object audio data and the virtual object video data to the cloud mobile phone client.

2. The method of claim 1, wherein the receiving target audio data from the cloud handset client comprises:

and receiving target audio data from the cloud mobile phone client through real-time audio and video communication connection with the cloud mobile phone client.

3. The method of claim 1, wherein the sending the virtual object audio data and the virtual object video data to the cloud handset client comprises:

and sending the virtual object audio data and the virtual object video data to the cloud mobile phone client through real-time audio and video communication connection with the cloud mobile phone client.

4. An audio data processing method is applied to a cloud mobile phone client, and the method comprises the following steps:

collecting target audio data and sending the target audio data to a communication application in a cloud mobile phone server so that the communication application can input the target audio data into a microphone input interface of the cloud mobile phone server, wherein the communication application is used for processing data transmission with a cloud mobile phone client; and

receiving virtual object audio data of the communication application from the cloud mobile phone server and virtual object video data of the communication application from the cloud mobile phone server, and playing the virtual object audio data and the virtual object video data, wherein the virtual object audio data and the virtual object video data are determined by a virtual object application in the cloud mobile phone server according to the target audio data acquired by a microphone output interface of the cloud mobile phone server, the virtual object audio data are input by the virtual object application into a speaker input interface and acquired by the communication application from a speaker output interface, and the virtual object video data are input by the virtual object application into a display input interface and acquired by the communication application from a display output interface, and the virtual object application is used for processing the target audio data.

5. The method of claim 4, wherein the sending the target audio data to the cloud mobile phone server comprises:

and sending the target audio data to the cloud mobile phone server through real-time audio and video communication connection with the cloud mobile phone server.

6. The method of claim 4, wherein the receiving virtual object audio data and virtual object video data from the cloud handset server comprises:

and receiving virtual object audio data and virtual object video data from the cloud mobile phone server through real-time audio and video communication connection with the cloud mobile phone server.

7. An audio data processing device, applied to a cloud mobile phone server, the cloud mobile phone server is configured with a communication application and a virtual object application, the communication application is used for processing data transmission with a cloud mobile phone client, and the virtual object application is used for processing target audio data, the device comprises:

the receiving module is used for receiving the target audio data from the cloud mobile phone client;

the processing module is used for determining virtual object audio data and virtual object video data according to the target audio data; and

the sending module is used for sending the virtual object audio data and the virtual object video data to the cloud mobile phone client;

wherein the processing module is further configured to perform the following operations with the communication application:

receiving target audio data from the cloud mobile phone client; and

the processing module is further configured to perform the following operations with the virtual object application:

determining the virtual object audio data and the virtual object video data according to the target audio data; and

the processing module is further configured to perform the following operations with the communication application:

and sending the virtual object audio data and the virtual object video data to the cloud mobile phone client.

8. The apparatus of claim 7, wherein the receiving target audio data from the cloud handset client comprises:

9. The apparatus of claim 7, wherein the sending the virtual object audio data and the virtual object video data to the cloud handset client comprises:

10. An audio data processing device applied to a cloud mobile phone client, the device comprising:

the audio acquisition module is used for acquiring target audio data;

a communication module configured to send the target audio data to a communication application in a cloud mobile phone server, so that the communication application inputs the target audio data to a microphone input interface of the cloud mobile phone server, where the communication application is configured to process data transmission with a cloud mobile phone client, and receive virtual object audio data of the communication application from the cloud mobile phone server and virtual object video data of the communication application from the cloud mobile phone server, where the virtual object audio data and the virtual object video data are determined by a virtual object application in the cloud mobile phone server according to the target audio data acquired by a microphone output interface of the cloud mobile phone server, the virtual object audio data is acquired by the virtual object application input speaker input interface and by the communication application from a speaker output interface, and the virtual object video data is acquired by the virtual object application input display input interface and by the communication application from a display output interface, and the virtual object application is configured to process the target audio data;

the audio playing module is used for playing the virtual object audio data; and

and the display module is used for playing the virtual object video data.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.