CN113380248A

CN113380248A - Voice control method, device, equipment and storage medium

Info

Publication number: CN113380248A
Application number: CN202110653278.9A
Authority: CN
Inventors: 王建业; 常乐; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-09-10
Anticipated expiration: 2041-06-11
Also published as: CN113380248B

Abstract

The application provides a voice control method, a voice control device, voice control equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: through the target application, a second audio is collected while the first audio is played; performing echo cancellation on the second audio; performing voice recognition on the second audio after the echo cancellation to obtain a voice control instruction; and controlling the target application according to the voice control instruction. The scheme provides a new target application, and the target application can also support the voice control function while playing audio, so that the voice control is more flexible. In addition, the echo of the first audio frequency is considered to be possibly included in the second audio frequency collected when the first audio frequency is played, so that the echo of the second audio frequency is eliminated, the interference of the echo of the first audio frequency can be avoided, the accuracy of voice recognition of the second audio frequency is ensured, and the accuracy of voice control is ensured.

Description

Voice control method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice control.

Background

In daily life, a user often needs to control a target application to perform some operation. Generally, a user controls a target application through a manual trigger mode, but with the development of natural language processing technology, the user does not need to manually trigger nowadays, and only needs to control the target application to perform some operations through voice. However, the voice control function that can be realized by the target application is limited.

Disclosure of Invention

The embodiment of the application provides a voice control method, a voice control device, voice control equipment and a storage medium, which can enhance the voice control function of a target application, so that the target application supports the voice control function under the condition of playing audio. The technical scheme is as follows:

in one aspect, a method for controlling voice is provided, the method comprising:

through the target application, a second audio is collected while the first audio is played;

performing echo cancellation on the second audio;

performing voice recognition on the second audio after the echo cancellation to obtain a voice control instruction;

and controlling the target application according to the voice control instruction.

In a possible implementation manner, the performing speech recognition on the echo-cancelled second audio to obtain a speech control instruction includes:

adjusting the volume of the second audio frequency after the echo cancellation to be within a target volume range;

and carrying out voice recognition on the second audio after the volume adjustment to obtain the voice control instruction.

extracting voiceprint information of the second audio frequency after echo cancellation;

comparing the voiceprint information with voiceprint information in a voiceprint library, wherein the voiceprint library is used for storing the voiceprint information with the voice control authority of the target application;

and on the premise that the voiceprint library comprises the voiceprint information of the second audio, performing voice recognition on the second audio after the echo is eliminated to obtain the voice control instruction.

performing voice recognition on the second audio after the echo cancellation to obtain a text corresponding to the second audio;

and extracting the voice control instruction from the text.

In a possible implementation manner, the controlling the target application according to the voice control instruction includes:

responding to the fact that the voice control instruction is a bullet screen issuing instruction, and extracting a target text located behind the bullet screen issuing instruction from the text;

and issuing a bullet screen in an audio playing interface corresponding to the first audio, wherein the bullet screen comprises the target text.

In a possible implementation manner, after the barrage is published in the audio playing interface corresponding to the first audio, the method further includes:

and displaying the audio playing interface comprising the bullet screen.

and responding to the voice control instruction as an audio switching instruction, and controlling the target application to switch the first audio.

and controlling the target application to stop playing the first audio in response to the voice control instruction being a play stop instruction.

and in response to the fact that the voice control instruction is an audio sharing instruction, controlling the target application to generate a sharing link of the first audio, and issuing the sharing link to a target page indicated by the audio sharing instruction.

responding to the voice control instruction as a chorus instruction, controlling the target application to play the first audio from the beginning, and collecting a third audio;

and in response to the completion of the acquisition of the third audio, synthesizing the first audio and the third audio to obtain chorus audio.

In a possible implementation manner, the synthesizing the first audio and the third audio to obtain a chorus audio includes:

in the case that the first audio comprises human voice and background audio, removing the human voice in the first audio;

and synthesizing the background audio in the obtained first audio and the third audio to obtain the chorus audio.

in response to the voice control instruction being an audio collection instruction, controlling the target application to add the first audio to an audio set indicated by the audio collection instruction.

and responding to the voice control instruction as an audio downloading instruction, and controlling the target application to download the first audio.

In one possible implementation manner, the capturing, by the target application, the second audio while the first audio is played includes:

transmitting, by the target application, the first audio to a connected headset based on a first protocol, the headset to play the first audio;

receiving the second audio transmitted by the earphone based on a second protocol, wherein the second audio is collected by the earphone while the first audio is played.

In another aspect, a voice control apparatus is provided, the apparatus comprising:

an audio capture module configured to capture, by a target application, a second audio while playing a first audio;

an echo cancellation module configured to perform echo cancellation on the second audio;

the voice recognition module is configured to perform voice recognition on the second audio after the echo cancellation to obtain a voice control instruction;

and the application control module is configured to control the target application according to the voice control instruction.

In one possible implementation manner, the voice recognition module is configured to adjust the volume of the second audio after echo cancellation to be within a target volume range; and carrying out voice recognition on the second audio after the volume adjustment to obtain the voice control instruction.

In one possible implementation, the speech recognition module is configured to extract voiceprint information of the second audio after echo cancellation;

comparing the voiceprint information with voiceprint information in a voiceprint library, wherein the voiceprint library is used for storing the voiceprint information with the voice control authority of the target application; and on the premise that the voiceprint library comprises the voiceprint information of the second audio, performing voice recognition on the second audio after the echo is eliminated to obtain the voice control instruction.

In a possible implementation manner, the speech recognition module is configured to perform speech recognition on the second audio after echo cancellation to obtain a text corresponding to the second audio; and extracting the voice control instruction from the text.

In a possible implementation manner, the application control module is configured to, in response to that the voice control instruction is a bullet screen issuing instruction, extract a target text located after the bullet screen issuing instruction from the text; and issuing a bullet screen in an audio playing interface corresponding to the first audio, wherein the bullet screen comprises the target text.

In one possible implementation, the apparatus further includes:

an interface display module configured to display the audio playback interface including the bullet screen.

In a possible implementation manner, the application control module is configured to control the target application to switch the first audio in response to the voice control instruction being an audio switching instruction.

In a possible implementation manner, the application control module is configured to control the target application to stop playing the first audio in response to the voice control instruction being a stop playing instruction.

In a possible implementation manner, the application control module is configured to control the target application to generate a sharing link of the first audio in response to that the voice control instruction is an audio sharing instruction, and issue the sharing link to a target page indicated by the audio sharing instruction.

In a possible implementation manner, the application control module is configured to control the target application to play the first audio from the beginning and acquire a third audio in response to the voice control instruction being a chorus instruction; and in response to the completion of the acquisition of the third audio, synthesizing the first audio and the third audio to obtain chorus audio.

In one possible implementation, the application control module is configured to remove a human voice in the first audio if the first audio includes a human voice and background audio; and synthesizing the background audio in the obtained first audio and the third audio to obtain the chorus audio.

In one possible implementation manner, the application control module is configured to control the target application to add the first audio to an audio set indicated by an audio collection instruction in response to the voice control instruction being the audio collection instruction.

In one possible implementation manner, the application control module is configured to control the target application to download the first audio in response to the voice control instruction being an audio download instruction.

In one possible implementation, the audio capture module is configured to transmit, by the target application, the first audio to a connected headset based on a first protocol, the headset being configured to play the first audio; receiving the second audio transmitted by the earphone based on a second protocol, wherein the second audio is collected by the earphone while the first audio is played.

In another aspect, an electronic device is provided, which includes a processor and a memory, where at least one program code is stored in the memory, and the program code is loaded by the processor and executed to implement the operations executed in the voice control method in any one of the above possible implementations.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the operations performed in the voice control method in any one of the above possible implementation manners.

In another aspect, a computer program product is provided, which includes at least one program code, and the program code is loaded and executed by a processor to implement the operations performed in the voice control method in any of the above possible implementations.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the embodiment of the application provides a new target application, and the target application can also support a voice control function while playing audio, so that the voice control is more flexible. In addition, the echo of the first audio frequency is considered to be possibly included in the second audio frequency collected when the first audio frequency is played, so that the echo of the second audio frequency is eliminated, the interference of the echo of the first audio frequency can be avoided, the accuracy of voice recognition of the second audio frequency is ensured, and the accuracy of voice control is ensured.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of a voice control method provided in an embodiment of the present application;

FIG. 3 is a flow chart of a voice control method provided by an embodiment of the present application;

fig. 4 is a block diagram of a voice control apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," "third," "fourth," and the like as used herein may be used herein to describe various concepts, but these concepts are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first audio may be referred to as an audio and, similarly, a second audio may be referred to as a first audio without departing from the scope of the present application.

As used herein, the terms "at least one," "a plurality," "each," and "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality. For example, the plurality of voiceprint information includes 3 voiceprint information, each voiceprint information refers to each voiceprint information in the 3 voiceprint information, and any one of the 3 voiceprint information refers to any one of the 3 voiceprint information, which may be the first one, the second one, or the third one.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected via a wireless or wired network. Optionally, the terminal 101 is a computer, a mobile phone, a tablet computer, a smart watch, a smart speaker, a smart home, or other terminals. Optionally, the server 102 is a background server or a cloud server providing services such as cloud computing and cloud storage.

Optionally, the terminal 101 has installed thereon a target application served by the server 102, and the terminal 101 can implement functions such as data transmission, message interaction, and the like through the target application. Optionally, the target application is a target application in an operating system of the terminal 101, or a target application provided by a third party. The target application has an audio playing function and a voice control function, and optionally, of course, the target application can also have other functions, for example, a video playing function, a game function, a live broadcast function, a chat function, and the like, which is not limited in this embodiment of the application. Optionally, the target application is a music application, a video application, a live application, a chat application, and the like, which is not limited in this embodiment of the application.

In this embodiment of the application, the terminal 101 is configured to capture a second audio while playing the first audio through the target application, and upload the second audio to the server 102. The server 102 is configured to perform echo cancellation on the second audio, perform voice recognition on the second audio after echo cancellation to obtain a voice control instruction, and send the voice control instruction to the terminal 101, where the terminal 101 is configured to control the target application according to the voice control instruction.

It should be noted that the embodiment of the present application is described by taking an example in which the implementation environment includes only the terminal 101 and the server 102, and in other embodiments, the implementation environment includes only the terminal 101. Voice control of the target application is achieved by the terminal 101.

The voice control method provided by the application can be applied to a scene of voice control of the target application, for example, when the terminal plays music through the target application, if a user wants to play down a piece of music, the user only needs to say 'play next' without manually triggering the audio switching control, and then the target application can be controlled to play the next piece of music. For another example, when the terminal plays a video (the video includes audio and video pictures) through the target application, if the user wants to pause the playing of the video, the user only needs to say "pause the video playing" without manually triggering the stop playing control, and then the target application can be controlled to stop playing the video.

Fig. 2 is a flowchart of a voice control method according to an embodiment of the present application. Referring to fig. 2, the execution subject is an electronic device, and the method includes:

201. the second audio is captured while the first audio is played by the target application.

202. Echo cancellation is performed on the second audio.

203. And performing voice recognition on the second audio after the echo cancellation to obtain a voice control instruction.

204. And controlling the target application according to the voice control instruction.

In a possible implementation manner, performing speech recognition on the echo-cancelled second audio to obtain a speech control instruction includes:

and carrying out voice recognition on the second audio frequency after the volume adjustment to obtain a voice control instruction.

extracting voiceprint information of the echo-cancelled second audio;

and on the premise that the voiceprint library comprises the voiceprint information of the second audio, performing voice recognition on the second audio after the echo is eliminated to obtain a voice control instruction.

and extracting the voice control instruction from the text.

In one possible implementation manner, controlling the target application according to the voice control instruction includes:

responding to the voice control instruction as a bullet screen issuing instruction, and extracting a target text positioned behind the bullet screen issuing instruction from the text;

and issuing a bullet screen in an audio playing interface corresponding to the first audio, wherein the bullet screen comprises a target text.

and displaying an audio playing interface comprising the bullet screen.

and in response to the voice control instruction being a play stopping instruction, controlling the target application to stop playing the first audio.

and responding to the voice control instruction as an audio sharing instruction, controlling the target application to generate a sharing link of the first audio, and issuing the sharing link to a target page indicated by the audio sharing instruction.

responding to the fact that the voice control instruction is a chorus instruction, controlling the target application to play a first audio from the beginning and collecting a third audio;

In one possible implementation, synthesizing the first audio and the third audio to obtain a chorus audio includes:

and synthesizing the background audio frequency in the obtained first audio frequency and the third audio frequency to obtain the chorus audio frequency.

in response to the voice control instruction being an audio collection instruction, the control target application adds the first audio to an audio set indicated by the audio collection instruction.

In one possible implementation, capturing, by a target application, a second audio while playing a first audio includes:

transmitting, by the target application, a first audio to the connected headphones based on the first protocol, the headphones being for playing the first audio;

and receiving second audio transmitted by the earphone based on a second protocol, wherein the second audio is collected by the earphone while the first audio is played.

Fig. 3 is a flowchart of a voice control method according to an embodiment of the present application. Referring to fig. 3, the method includes:

301. the terminal collects a second audio while playing the first audio through the target application.

The first audio is any audio, for example, the first audio is audio in a local audio library, where the audio in the local audio library includes audio recorded by a user and audio downloaded from an online. Or the first audio is audio which can be played online without downloading. In addition, the first audio also includes audio in the video.

The second audio is the audio currently input by the user. Optionally, the terminal has a voice input module, e.g. a microphone, thereon, through which the terminal captures the second audio.

Optionally, the terminal displays an audio playing interface corresponding to the first audio while playing the first audio. The audio playing interface comprises a plurality of controls, and a user can control the target application to execute various operations through the various controls. For example, an audio switching control for switching a currently playing first audio to another song in the menu. And the pause control is used for pausing the playing of the currently played first audio. And the downloading control is used for downloading the first audio currently played. And the like control is used for like the first audio currently played. And the comment control is used for commenting the first audio and the like. Optionally, the audio playing interface further includes at least one bullet screen, and the bullet screen is issued by the terminal playing the first audio. Optionally, the audio playing interface further includes text corresponding to the first audio. For example, if the first audio is a song, the text is lyrics, or if the first audio is a recording, the text is recording content, or if the first audio is background audio of a video, the text is subtitles of the video, and the like. Of course, the audio playing interface can also include other controls or information, which is not limited in this embodiment of the application.

In a possible implementation manner, a speaker is configured on the terminal, and accordingly, the terminal acquires the second audio while playing the first audio through the target application, including: the terminal calls the loudspeaker to play the first audio through the target application, and collects the second audio while playing the first audio. Therefore, the function of voice control can be enhanced, and the voice control of the target application can be realized under the condition that the terminal plays the first audio externally.

In a possible implementation manner, the terminal, through the target application, acquires the second audio while playing the first audio, and includes: the terminal transmits a first audio to a connected earphone based on a first protocol through a target application, and the earphone is used for playing the first audio; and the terminal receives a second audio transmitted by the earphone based on a second protocol, wherein the second audio is collected while the earphone plays the first audio.

Optionally, the headset is a bluetooth headset, such as a TWS (True Wireless Stereo) headset, an infrared headset or other headset.

Optionally, the first protocol is A2DP (Advanced Audio Distribution Profile, one-way high fidelity Audio protocol). The A2DP protocol is a bluetooth audio transmission protocol. The A2DP protocol enables 48kHz high fidelity stereo audio transmission, guaranteeing audio quality, but only supports unidirectional audio transmission.

Optionally, the second protocol is SPP (Serial Port Profile), which defines how to set up a virtual Serial Port and how to connect two bluetooth devices. The SPP uses an RFCOMM (serial port simulation protocol) to provide serial communication simulation, provides a method for replacing RS-232 serial port communication in a wireless mode, can realize 16KHz audio data transmission, and ensures audio quality.

In one possible implementation, the terminal transmits the first audio to the connected headset through the target application based on the first protocol, including: the terminal establishes a first communication link between the terminal and the earphone based on a first protocol; and processing the first audio based on the first protocol through the target application to obtain a data packet corresponding to the first audio, and transmitting the data packet to the earphone based on the first communication link.

Optionally, the processing, by the terminal, the first audio based on the first protocol to obtain a data packet corresponding to the first audio includes: the terminal encodes the first audio based on the first protocol to obtain a data packet corresponding to the first audio. Correspondingly, after receiving the data packet based on the first communication link, the headset decodes the data packet based on the first protocol to obtain a first audio, and then plays the first audio.

In one possible implementation manner, the terminal receiving the second audio transmitted by the earphone based on the second protocol includes: the terminal establishes a second communication link between the terminal and the earphone based on a second protocol; and receiving the data packet transmitted by the earphone based on the second communication link, and transmitting the data packet to the target application.

And after the earphone collects the second audio, processing the second audio based on a second protocol to obtain a data packet corresponding to the second audio. The data packet is then transmitted to the terminal based on the second communication link, and the terminal sends the data packet to the target application. Optionally, the processing, by the headset, the second audio based on the second protocol to obtain a data packet corresponding to the second audio includes: and the earphone encodes the second audio based on the second protocol to obtain a data packet corresponding to the second audio. Correspondingly, the terminal receives the data packet based on the second communication link, and then decodes the data packet based on the second protocol to obtain a second audio. Optionally, the terminal decodes the data packet based on the second protocol through the target application to obtain the second audio.

It should be noted that, the bluetooth audio transmission protocol also includes HSP (Head Set Profile), HFP (hand-Free Profile), and although HSP and HFP can transmit bidirectional audio, the audio sampling rate is only 8KHz, and only single-channel voice transmission is supported, and the sound quality is poor. In the embodiment of the application, the terminal transmits the first audio to the headset based on the A2DP protocol and receives the second audio transmitted by the headset based on the SPP protocol, and both the A2DP protocol and the SPP protocol support audio transmission with a high sampling rate, so that high sound quality of audio transmitted in both directions between the terminal and the headset can be maintained. On the one hand, the quality of the played first audio is guaranteed, on the other hand, the accuracy of voice recognition can be improved, and accurate voice control on target application is achieved.

Another point to be noted is that the terminal receives the second audio transmitted by the headset based on the second protocol, while still transmitting the first audio to the headset based on the first protocol. That is, at the same time, the A2DP protocol is supported to play the first audio and the SPP protocol is supported to capture the second audio.

302. And the terminal performs echo cancellation on the second audio through the target application.

The echo is that the sound played by the terminal is collected by the terminal, so that the collected sound includes the sound played by the current terminal. In the embodiment of the present application, since the terminal is the second audio collected while playing the first audio, the second audio may include an echo of the first audio. Echo cancellation refers to: the echo of the first audio included in the captured second audio is cancelled such that the captured second audio includes only the user's voice.

Optionally, the terminal performs echo cancellation on the second audio based on any echo cancellation algorithm, and optionally, the echo cancellation algorithm is an echo cancellation algorithm based on adaptive filtering, for example, a Normalized Least Mean Square (NLMS) algorithm, a Least Mean Square (NLMS) algorithm, and the like, which is not limited in this embodiment of the present application.

In the embodiment of the application, through carrying out echo cancellation on the second audio, the echo interference of the first audio in the second audio can be avoided, and the quality of the second audio is improved, so that the voice of a user in the second audio is clearer, and when carrying out voice recognition on the second audio, the accuracy of voice recognition can be improved, namely, the accuracy of the obtained voice control instruction is ensured, and the robustness of voice control is enhanced.

303. And the terminal performs voice recognition on the second audio after the echo cancellation through the target application to obtain a text corresponding to the second audio.

Speech recognition is a technique that converts speech into text. The text obtained by performing speech recognition on the second audio includes the content of the second audio, that is, the semantics expressed by the second audio and the text are the same. The terminal converts the second audio into a text first, so that further processing based on the text can be conveniently carried out subsequently.

In a possible implementation manner, after performing echo cancellation on the second audio, the terminal further adjusts the volume of the second audio, and correspondingly, the terminal performs voice recognition on the echo-cancelled second audio through the target application to obtain a voice control instruction, including: the terminal adjusts the volume of the second audio frequency after the echo cancellation to be within a target volume range through target application; and carrying out voice recognition on the second audio frequency after the volume adjustment to obtain a voice control instruction. Alternatively, the target volume range is set according to actual conditions. Optionally, the terminal adjusts the volume of the second audio by using an Automatic Gain Control (AGC) technique.

Wherein, carry out speech recognition to the second audio frequency after the volume adjustment, obtain the speech control instruction, include: and performing voice recognition on the second audio after the volume adjustment to obtain a text corresponding to the second audio, and extracting a voice control instruction from the text through the target application.

In the embodiment of the application, before performing the voice recognition on the second audio, the volume of the second audio is adjusted first, so that the voice of the user can be clearly recognized under the condition that the voice input by the user is small, and the phenomenon of sound amplitude interception under the condition that the voice input by the user is large can be avoided, so that the recognition is inaccurate.

Wherein the sound clipping phenomenon is as follows: since the amplitude of the audio signal is too large, exceeding the range of the audio acquisition device, the amplitude of the audio signal at the peak point is intercepted, resulting in distortion of the audio signal.

In a possible implementation manner, after performing echo cancellation on the second audio, the terminal also performs noise suppression on the echo-cancelled second audio, and then performs voice recognition on the noise-suppressed second audio to obtain a voice control instruction. Optionally, the terminal performs noise suppression on the second audio by using any noise suppression algorithm to remove or reduce noise in the second audio, so that the user voice in the second audio is clearer and the accuracy of voice recognition is improved.

In a possible implementation manner, the method for performing voice recognition on the echo-cancelled second audio by the terminal through the target application to obtain the voice control instruction includes: the terminal extracts voiceprint information of the second audio frequency after echo cancellation through target application; the terminal compares the voiceprint information with voiceprint information in a voiceprint library, and the voiceprint library is used for storing the voiceprint information with the voice control authority of the target application; and the terminal performs voice recognition on the second audio after the echo cancellation to obtain a voice control instruction on the premise that the voiceprint library comprises the voiceprint information of the second audio. And performing voice recognition on the echo-cancelled second audio to obtain a voice control instruction, wherein the voice recognition is performed on the echo-cancelled second audio to obtain a text corresponding to the second audio, and the voice control instruction is extracted from the text.

Voiceprint (Voice Print) is the same as a fingerprint, is different from person to person, is independent of accent and language, and can be used for identity recognition. The voiceprint information of the second audio can reflect the voiceprint characteristics of the user, so that the identity of the user can be reflected.

Optionally, a voiceprint library is pre-stored in the terminal, and the voiceprint library includes at least one voiceprint information having the voice control authority of the target application. Optionally, the voiceprint information in the voiceprint library is entered by the user in advance. For example, the terminal collects voiceprint information of the user through the target application in advance, stores the voiceprint information of the user in a voiceprint library of the terminal, and then when the user controls the target application through voice, the terminal checks whether the voiceprint information of the audio currently input by the user is the voiceprint information in the voiceprint library, determines that the user has the control authority of the target application under the condition that the voiceprint information of the audio is the voiceprint information in the voiceprint library, and then continuously responds to the audio input by the user to control the target application. Optionally, the user can also enter voiceprint information of others, such as the user's parent, into the voiceprint library so that others can also voice control the target application.

In the embodiment of the application, voiceprint information with the voice control authority of the target application is stored through the voiceprint library, after the audio of a user is collected, whether the user has the voice control authority of the target application is determined by utilizing the voiceprint library, and on the premise that the user is determined to have the voice control authority of the target application, the voice control can be performed according to the audio of the user, on one hand, false recognition can be avoided, namely, the voice of other people is recognized as the voice of the user, so that false control over the target application is caused, and on the other hand, the safety of the target application can be improved.

304. And the terminal extracts the voice control instruction from the text through the target application.

Optionally, the terminal performs word segmentation processing on the text through a target application to obtain at least one word included in the text, determines similarity between each word and each voice control instruction in the instruction library, and determines the voice control instruction with the highest similarity to the word in the text in the instruction library as the voice control instruction extracted from the text. For example, the text corresponding to the second audio is "play next song", word segmentation is performed on the text to obtain "play", "next" and "song", the terminal determines that the similarity between the voice control instruction "play next" in the instruction library and the words in the text is the highest by calculating the similarity, and then determines "play next" as the voice control instruction extracted from the text.

Optionally, the terminal may further be configured to directly determine the similarity between the voice control instruction in the instruction library and the text without performing word segmentation processing on the text, and determine the voice control instruction with the maximum similarity as the voice control instruction extracted from the text. The method is simple and efficient. Of course, the terminal can also extract the voice control instruction in the text in other manners, which is not limited in this embodiment of the application.

Optionally, the terminal uploads the collected second audio to the server, the server processes the second audio to obtain the voice control instruction, and the voice control instruction is issued to the terminal.

305. And the terminal controls the target application according to the voice control instruction.

The voice control instructions are of various types, and different types of voice control instructions are used for controlling the target application to execute different operations.

In one possible implementation, the method includes: the terminal responds to the voice control instruction and issues an instruction for the bullet screen, and a target text positioned behind the bullet screen issuing instruction is extracted from the text through the target application; and issuing a bullet screen in an audio playing interface corresponding to the first audio, wherein the bullet screen comprises a target text. Optionally, the implementation manner of the terminal issuing the barrage in the audio playing interface corresponding to the first audio is as follows: the terminal sends the first audio and the barrage to the server, the server sends the first audio and the barrage to each terminal which plays the first audio currently, and each terminal displays the barrage in the current audio playing interface.

Optionally, the bullet screen issuing instruction is "issue bullet screen", and accordingly, when the user needs to issue a bullet screen that praises the currently played first audio, only "issue bullet screen" needs to be spoken: the terminal can extract a voice control command 'release the barrage' from the terminal and a target text 'the song taila' positioned behind the voice control command, and then release a barrage comprising the 'the song taila' from an audio playing interface corresponding to the first audio.

In this application embodiment, through the setting of barrage issue instruction for the user only needs to pass through pronunciation under the condition that need not manual operation, can control the barrage that issues the first audio frequency of current broadcast, has improved the release efficiency of barrage, has strengthened the speech control function, has improved user experience.

In a possible implementation manner, after the terminal issues the bullet screen in the audio playing interface corresponding to the first audio, the method further includes: the terminal displays an audio playing interface comprising the bullet screen, so that a user can check the currently published bullet screen in the audio playing interface in real time.

For example, the terminal displays an audio playing interface of the first audio while playing the first audio, at this time, the user inputs a barrage issuing instruction through voice, and the terminal issues a barrage in the current audio playing interface.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: the terminal responds to the fact that the voice control instruction is an audio sharing instruction, controls the target application to generate a sharing link of the first audio, and issues the sharing link to a target page indicated by the audio sharing instruction. Optionally, the target page is an audio sharing page of the target application, or the target page is a page of another application. For example, if the target application is an audio playing application and the other applications are chat applications, the user can control the target application to publish the sharing link of the first audio to a page of another chat application through voice control. Optionally, the sharing link of the first audio includes information such as a name and an author of the first audio, and optionally, the sharing link of the first audio further includes an image corresponding to the first audio, where the image is album art, author portrait, and the like corresponding to the first audio, and this is not limited in this embodiment of the application.

In the embodiment of the application, the audio sharing instruction is set, so that the user can control the target application to share the audio only through voice under the condition that manual operation is not needed, the voice control function of the target application is enhanced, and the operation efficiency and the user experience are improved.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: and the terminal responds to the fact that the voice control instruction is a chorus instruction, controls the target application to play the first audio from the beginning, collects the third audio, responds to the completion of the collection of the third audio, and synthesizes the first audio and the third audio to obtain the chorus audio. Optionally, the terminal can also store the chorus audio. Optionally, the terminal is also capable of playing the chorus audio. Optionally, under the condition that the first audio includes the human voice and the background audio, when the terminal synthesizes the first audio and the third audio, the human voice in the first audio is removed first, and then the background audio in the first audio and the third audio are synthesized to obtain the chorus audio. Optionally, the terminal determines that the third audio acquisition is completed in response to the completion of the playing of the first audio, or the terminal determines that the third audio acquisition is completed in response to an instruction of a user, which is not limited in this embodiment of the present application. It should be noted that the terminal controls the target application to play the first audio from the beginning, and controls the target application to continuously capture the third audio until the third audio is captured.

In the embodiment of the application, through the setting of the chorus instruction, the user can control the target application to realize the chorus function only through voice under the condition of not needing manual operation, the voice control function of the target application is enhanced, and the user experience is improved.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: and the terminal responds to the fact that the voice control instruction is a live broadcast room jump instruction, controls the target application to jump to a live broadcast interface corresponding to the target live broadcast room, and switches the currently played first audio to the audio in the target live broadcast room, wherein the target live broadcast room is any live broadcast room, for example, the target live broadcast room is a live broadcast room with live broadcast content related to the first audio. In the embodiment of the application, the target application can be controlled to enter the live broadcast room only by voice under the condition that manual operation is not needed by a user through the setting of the jump instruction of the live broadcast room, so that the voice control function of the target application is enhanced, and the user experience is improved.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: and the terminal responds to the voice control instruction as an audio switching instruction and controls the target application to switch the first audio. Optionally, the first audio is any audio in the playlist. Namely, the terminal is provided with a play list in the target application, the play list comprises a plurality of audios, the terminal plays each audio in sequence according to the play list, and the first audio is the currently played audio.

For example, if the currently played first audio is one song in the playlist, the terminal switches the currently played first audio to another song in the playlist in response to the voice control instruction being the audio switching instruction.

Optionally, the audio switching instruction comprises a forward switching instruction, a backward switching instruction, and a random switching instruction. And the terminal responds to the fact that the voice control instruction is a forward switching instruction, and switches the first audio currently played into one audio before the first audio in the play list. And the terminal responds to the fact that the voice control instruction is a backward switching instruction, and switches the first audio frequency played currently into one audio frequency behind the first audio frequency in the play list. And the terminal responds to the fact that the voice control instruction is a random switching instruction, and randomly switches the currently played first audio to another audio in the play list.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: and the terminal responds to the voice control instruction as a play stopping instruction and controls the target application to stop playing the first audio. Or the terminal responds to the voice control instruction as a circular playing instruction and controls the target application to circularly play the currently played first audio.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: the terminal responds to the voice control instruction as an audio downloading instruction, and controls the target application to download the first audio. Then, even in the case of no network, the terminal can acquire the first audio from the local and play it.

In a possible implementation manner, the controlling, by the terminal, the target application according to the voice control instruction includes: and the terminal responds to the voice control instruction as an audio collection instruction, and the control target application adds the first audio to the audio set indicated by the audio collection instruction. The audio collection comprises at least one audio, and the audio in the audio collection is added by the user, so that the user can quickly inquire the collected audio by using the audio collection.

In the embodiment of the application, by setting various voice control instructions such as the audio switching instruction, the playing stopping instruction, the circulating playing instruction and the like, a user can perform various operations on the target application through voice control without any manual operation, and the voice control function of the target application is enhanced.

Actually, in the embodiment of the present application, the type of the voice control instruction can be set according to actual needs, that is, the target application can be controlled to execute any operation through the voice control instruction, for example, an application closing instruction for closing the target application is set to control the target application to close the target application through voice, which is not limited in the embodiment of the present application.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 4 is a block diagram of a voice control apparatus according to an embodiment of the present application. Referring to fig. 4, the apparatus includes:

an audio capture module 401 configured to capture, by the target application, a second audio while playing the first audio;

an echo cancellation module 402 configured to perform echo cancellation on the second audio;

a voice recognition module 403, configured to perform voice recognition on the echo-cancelled second audio, so as to obtain a voice control instruction;

and an application control module 404 configured to control the target application according to the voice control instruction.

In one possible implementation, the voice recognition module 403 is configured to adjust the volume of the echo-cancelled second audio to be within a target volume range; and carrying out voice recognition on the second audio frequency after the volume adjustment to obtain a voice control instruction.

In one possible implementation, the speech recognition module 403 is configured to extract voiceprint information of the echo-cancelled second audio;

comparing the voiceprint information with voiceprint information in a voiceprint library, wherein the voiceprint library is used for storing the voiceprint information with the voice control authority of the target application; and on the premise that the voiceprint library comprises the voiceprint information of the second audio, performing voice recognition on the second audio after the echo is eliminated to obtain a voice control instruction.

In a possible implementation manner, the speech recognition module 403 is configured to perform speech recognition on the echo-cancelled second audio to obtain a text corresponding to the second audio; and extracting the voice control instruction from the text.

In one possible implementation manner, the application control module 404 is configured to, in response to the voice control instruction being a bullet screen issuing instruction, extract a target text located after the bullet screen issuing instruction from the text; and issuing a bullet screen in an audio playing interface corresponding to the first audio, wherein the bullet screen comprises a target text.

In one possible implementation, the apparatus further includes:

and the interface display module is configured to display an audio playing interface comprising a bullet screen.

In one possible implementation, the application control module 404 is configured to control the target application to switch the first audio in response to the voice control instruction being an audio switching instruction.

In one possible implementation, the application control module 404 is configured to control the target application to stop playing the first audio in response to the voice control instruction being a stop playing instruction.

In one possible implementation manner, the application control module 404 is configured to, in response to that the voice control instruction is an audio sharing instruction, control the target application to generate a sharing link of the first audio, and issue the sharing link to a target page indicated by the audio sharing instruction.

In one possible implementation, the application control module 404 is configured to, in response to the voice control instruction being a chorus instruction, control the target application to play the first audio from the beginning and acquire a third audio; and in response to the completion of the acquisition of the third audio, synthesizing the first audio and the third audio to obtain chorus audio.

In one possible implementation, the application control module 404 is configured to remove the human voice in the first audio in a case that the first audio includes the human voice and background audio; and synthesizing the background audio frequency in the obtained first audio frequency and the third audio frequency to obtain the chorus audio frequency.

In one possible implementation, the application control module 404 is configured to, in response to the voice control instruction being an audio collection instruction, control the target application to add the first audio to an audio set indicated by the audio collection instruction.

In one possible implementation, the application control module 404 is configured to control the target application to download the first audio in response to the voice control instruction being an audio download instruction.

In one possible implementation, the audio capture module 401 is configured to transmit, by the target application, the first audio to the connected headphones based on the first protocol, the headphones being used to play the first audio; and receiving second audio transmitted by the earphone based on a second protocol, wherein the second audio is collected by the earphone while the first audio is played.

It should be noted that: in the voice control apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration when performing voice control, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the electronic device may be divided into different functional modules to complete all or part of the above described functions. In addition, the voice control apparatus and the voice control method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

The embodiment of the present application further provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor, so as to implement the operations executed in the voice control method of the foregoing embodiment.

Optionally, the electronic device is provided as a terminal. Fig. 5 shows a block diagram of a terminal 500 according to an exemplary embodiment of the present application. The terminal 500 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 500 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

The terminal 500 includes: a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one program code for execution by processor 501 to implement the voice control methods provided by method embodiments herein.

In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, display screen 505, camera assembly 506, audio circuitry 507, positioning assembly 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in other embodiments, the display 505 may be a flexible display disposed on a curved surface or a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is used for positioning the current geographic Location of the terminal 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the display screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the terminal 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side frame of the terminal 500 and/or underneath the display screen 505. When the pressure sensor 513 is disposed on the side frame of the terminal 500, a user's holding signal of the terminal 500 may be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the terminal 500. When a physical button or a vendor Logo is provided on the terminal 500, the fingerprint sensor 514 may be integrated with the physical button or the vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the display screen 505 is increased; when the ambient light intensity is low, the display brightness of the display screen 505 is reduced. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also called a distance sensor, is provided at the front panel of the terminal 500. The proximity sensor 516 is used to collect the distance between the user and the front surface of the terminal 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 gradually decreases, the processor 501 controls the display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 becomes gradually larger, the display screen 505 is controlled by the processor 501 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Optionally, the electronic device is provided as a server. Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 601 and one or more memories 602, where at least one program code is stored in the memory 602, and the at least one program code is loaded and executed by the processors 601 to implement the voice control method provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where at least one program code is stored in the computer-readable storage medium, and the at least one program code is loaded and executed by a processor to implement the operations executed in the voice control method of the foregoing embodiment.

The embodiment of the present application further provides a computer program, where at least one program code is stored in the computer program, and the at least one program code is loaded and executed by a processor, so as to implement the operations executed in the voice control method of the foregoing embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for voice control, the method comprising:

performing echo cancellation on the second audio;

2. The method of claim 1, wherein the performing speech recognition on the echo-cancelled second audio to obtain a speech control command comprises:

3. The method of claim 1, wherein the performing speech recognition on the echo-cancelled second audio to obtain a speech control command comprises:

4. The method of claim 1, wherein the performing speech recognition on the echo-cancelled second audio to obtain a speech control command comprises:

and extracting the voice control instruction from the text.

5. The method of claim 4, wherein the controlling the target application according to the voice control instruction comprises:

6. The method of claim 5, wherein after the bullet-screen is published in the audio playback interface corresponding to the first audio, the method further comprises:

and displaying the audio playing interface comprising the bullet screen.

7. The method of claim 1, wherein the controlling the target application according to the voice control instruction comprises:

8. The method of claim 1, wherein the controlling the target application according to the voice control instruction comprises:

9. The method of claim 1, wherein the controlling the target application according to the voice control instruction comprises:

10. The method of claim 1, wherein the controlling the target application according to the voice control instruction comprises:

11. The method of claim 10, wherein the synthesizing the first audio and the third audio to obtain chorus audio comprises:

12. The method of claim 1, wherein the controlling the target application according to the voice control instruction comprises:

13. The method of claim 1, wherein the controlling the target application according to the voice control instruction comprises:

14. The method of claim 1, wherein capturing, by the target application, the second audio while playing the first audio comprises:

15. A voice control apparatus, characterized in that the apparatus comprises:

16. An electronic device, comprising a processor and a memory, wherein at least one program code is stored in the memory, and wherein the program code is loaded and executed by the processor to perform the operations performed by the voice control method according to any one of claims 1 to 14.

17. A computer-readable storage medium, in which at least one program code is stored, the program code being loaded and executed by a processor to implement the operations performed by the voice control method according to any one of claims 1 to 14.