CN114822525A

CN114822525A - Voice control method and electronic equipment

Info

Publication number: CN114822525A
Application number: CN202110130831.0A
Authority: CN
Inventors: 王晓博; 许嘉璐
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2022-07-29
Also published as: WO2022161077A1

Abstract

The application provides a voice control method and electronic equipment. The voice control method is applied to a voice control system, the voice control system at least comprises a first electronic device and a second electronic device which have voice control functions, and the method comprises the following steps: the method comprises the steps that first electronic equipment and second electronic equipment respectively receive a first voice instruction input by a user, and the first electronic equipment responds to the first voice instruction; recording by the second electronic equipment, and storing recording data, wherein the recording is used for recording a second voice instruction input by a user; the second electronic equipment sends the recording data of the second electronic equipment to the first electronic equipment; the first electronic equipment responds to the second voice instruction according to the recording data of the first electronic equipment and/or the recording data of the second electronic equipment; the recording data of the first electronic device comprises recording data of the second voice instruction which is recorded by the first electronic device and input by a user. The embodiment of the application can solve the problem of error recognition of voice control in a multi-device scene, and improves the accuracy of the voice control.

Description

Voice control method and electronic equipment

Technical Field

The present application relates to computer technologies, and in particular, to a voice control method and an electronic device.

Background

The voice assistant is a novel terminal Application (APP) based on a voice semantic algorithm, and provides service functions such as interactive conversation, information query, equipment control and the like by receiving and recognizing voice signals sent by users. With the continuous development of deep learning theory and the maturity of intelligent voice hardware, the voice assistant application program has become a necessary software function for terminal devices such as smart phones, tablet computers, smart televisions, smart sound boxes and the like.

With the widespread use of a large number of terminal devices equipped with voice assistants, many users already have multiple terminal devices of the same or different types. Under the scene that a user concurrently uses a plurality of terminal devices or the scene that the user voice interaction occurs in the effective working range of the plurality of terminal devices, the terminal device with the clearest pickup (namely, the terminal device closest to the user) is selected as a pickup inlet through signal detection and interaction negotiation between the terminal devices for being called by the voice assistant application program, and the recognition accuracy of the voice assistant application program can be improved. For example, the user's living room has three devices, namely a sound box, a television and a mobile phone, all of which have a voice assistant application program installed, and the wakeup words are all "small E and small E". Then, when the user says the wake-up word "small E", the voice assistant application program of the sound box, the television and the mobile phone selects one device as the answering device among the three devices by detecting the audio energy information of the wake-up word. Because the loudspeaker box is closest to the user, the three devices negotiate to select the loudspeaker box as the response device based on the audio energy information of the awakening word. The sound box wakes up the voice assistant application program of the sound box, and other devices do not respond to the wake-up word, namely do not wake up the respective voice assistant application program. Thus, after the user continues to speak the voice signal, only the sound box recognizes and responds to the voice signal of the user. For example, after the user speaks the speech signal "Play Song 112222," the speaker recognizes and responds to the speech signal. For example, the loudspeaker outputs a voice signal "song 112222 will be played for you".

In the multi-device voice control process, the answering device recognizes and responds to the voice signal of the user, however, due to the diversity and complexity of the use scene, the processing method has the problem that the answering device recognizes by mistake, namely, the answering device cannot accurately recognize the voice signal input by the user after the awakening word.

Disclosure of Invention

The application provides a voice control method and electronic equipment, which are used for solving the problem of error recognition of voice control in a multi-equipment scene and improving the accuracy of the voice control.

In a first aspect, an embodiment of the present application provides a voice control method, where the voice control method may be applied to a voice control system, where the voice control system may include at least a first electronic device and a second electronic device that have a voice control function, and the voice control method may include: the first electronic equipment and the second electronic equipment respectively receive a first voice instruction input by a user, and the first electronic equipment responds to the first voice instruction. And the second electronic equipment records the sound and stores the recorded sound data, wherein the sound record is used for recording a second voice command input by the user. And the second electronic equipment sends the recording data of the second electronic equipment to the first electronic equipment. And the first electronic equipment responds to the second voice instruction according to the recording data of the first electronic equipment and/or the recording data of the second electronic equipment. The recording data of the first electronic device comprises recording data of a second voice instruction input by the first electronic device.

The second electronic device may record the second voice command input by the user and store the second voice command, and after the first electronic device is determined as the answering device, the second electronic device may send the recorded data of the second electronic device to the first electronic device, and the first electronic device may answer the second voice command.

According to the implementation mode, the first electronic equipment responds to the first voice instruction as response equipment, the first electronic equipment and the second electronic equipment record the second voice instruction and store the recorded data, the second electronic equipment sends the recorded data to the first electronic equipment, and the first electronic equipment responds to the second voice instruction according to the recorded data of the first electronic equipment and/or the recorded data of the second electronic equipment. According to the implementation mode, the voice instruction input by the user is recorded through the non-response equipment, the response equipment carries out SE, ASR and other processing based on the recording data of the response equipment and/or the recording data of the non-response equipment, and communication time delay among the equipment in the process of selecting the response equipment is effectively eliminated, so that the problem of frame loss of voice control caused by time delay in a multi-equipment scene is solved. The response equipment responds to the second voice instruction through the recording data of the multi-equipment cooperative radio, the problem that the audio quality of the voice instruction picked up by the electronic equipment influences the ASR recognition accuracy can be solved, and the accuracy of voice control is improved.

In one possible design, the method may further include: the first electronic equipment calls a pickup instruction to the second electronic equipment, and the pickup instruction is used for the second electronic equipment to return recording data of the second electronic equipment.

In one possible design, the second electronic device recording may include: and recording by the second electronic equipment when or after the second electronic equipment receives the first voice instruction input by the user.

In this implementation, when or after the second electronic device receives the first voice instruction input by the user, the second electronic device records, that is, the second electronic device starts recording before the answering device is determined, the second electronic device may record the second voice instruction input by the user. Therefore, the communication time delay between the devices in the process of selecting the response device can be effectively eliminated, and the problem of frame loss of voice control caused by time delay in a multi-device scene is solved.

In one possible design, the method may further include: when or after the first electronic equipment receives a first voice instruction input by a user, the first electronic equipment records a record, wherein the record is used for recording a second voice instruction input by the user.

In one possible design, the first voice command is used to wake up a voice control function of the first electronic device and/or the second electronic device.

For ease of understanding, the first voice command may be a voice command of step 401 of the embodiment shown in fig. 3 described below.

In one possible design, the method may further include: and the first electronic equipment and the second electronic equipment respectively determine that the first electronic equipment is response equipment of the voice control system according to the audio quality information of the received first voice instruction.

In one possible design, after the first electronic device responds to the first voice command, before recording the second voice command input by the user, the method may further include: in the recording process of the first electronic device and the second electronic device, the first electronic device does not detect a second voice command input by a user within a preset time period, and the first electronic device deletes the stored recording data and continues recording. The first electronic device calls a multi-turn conversation pause instruction to the second electronic device, wherein the multi-turn conversation pause instruction is used for indicating that the multi-turn conversation is temporarily stopped. And the second electronic equipment deletes the saved recording data and continues recording.

For ease of understanding, the first voice command may be a voice command before step 701 of the embodiment shown in fig. 6 described below. The second voice command may be the voice command of step 703 of the embodiment shown in fig. 6 described below.

In one possible design, the method may further include: the first electronic equipment receives the audio quality information of the sound recording data of the second electronic equipment, which is sent by the second electronic equipment.

The implementation mode can accelerate the decision of the optimal radio equipment, thereby improving the response speed of voice control.

In one possible design, the responding, by the first electronic device, the second voice command according to the recorded sound data of the first electronic device and/or the recorded sound data of the second electronic device may include: and the first electronic equipment determines the optimal radio equipment from the voice control system according to the audio quality information of the recording data of the first electronic equipment and the audio quality information of the recording data of the second electronic equipment. And when the optimal radio equipment is the first electronic equipment, the first electronic equipment responds to the second voice command according to the recording data of the first electronic equipment or the recording data of the first electronic equipment and the recording data of the second electronic equipment. And when the optimal radio equipment is second electronic equipment, the first electronic equipment responds to the second voice command according to the recording data of the second electronic equipment or according to the recording data of the second electronic equipment and the recording data of the first electronic equipment. Wherein the audio quality information is used to indicate the audio quality of the sound recording data.

According to the implementation mode, the recording data of the optimal radio equipment is used for responding the second voice instruction, so that the influence of noise on the voice control accuracy rate can be reduced.

In one possible design, the responding, by the first electronic device, the second voice command according to the recorded sound data of the first electronic device and/or the recorded sound data of the second electronic device may include: and the first electronic equipment responds to the second voice instruction according to the audio content information of the recording data of the first electronic equipment and/or the audio content information of the recording data of the second electronic equipment. Wherein the audio content information is used to represent the audio content of the sound recording data.

For example, when the audio content information of the recording data of the first electronic device is more than the audio content information of the recording data of the second electronic device, the second voice command is responded according to the audio content information of the recording data of the first electronic device. And when the audio content information of the recording data of the first electronic equipment is less than that of the recording data of the second electronic equipment, responding to the second voice instruction according to the audio content information of the recording data of the second electronic equipment. For another example, when the audio content information of the recording data of the first electronic device and the audio content information of the recording data of the second electronic device have partially the same content, the first electronic device may splice the audio content information of the recording data of the first electronic device and the audio content information of the recording data of the second electronic device, and respond to the second voice command according to the spliced audio content information.

According to the implementation mode, the second voice instruction is responded by using the recording data of the multi-device cooperative reception, frame loss can be avoided, and the accuracy of voice control is improved.

In a second aspect, an embodiment of the present application provides a voice control method, which may be applied to a first electronic device of a voice control system, where the voice control system may further include at least a second electronic device, and the voice control method may include: the first electronic equipment receives a first voice instruction input by a user, and responds to the first voice instruction. The first electronic device receives recording data of the second electronic device sent by the second electronic device, wherein the recording data of the second electronic device comprises recording data of a second voice instruction input by a user recorded by the second electronic device. And the first electronic equipment responds to the second voice command according to the recording data of the first electronic equipment and/or the recording data of the second electronic equipment, wherein the recording data of the first electronic equipment comprises the recording data of the second voice command which is recorded by the first electronic equipment and input by a user.

In one possible design, the method may further include: when or after the first electronic equipment receives a first voice command input by a user, the first electronic equipment records a second voice command input by the user.

In one possible design, the method may further include: the first electronic equipment determines that the first electronic equipment is response equipment of the voice control system according to the audio quality information of the first voice instruction received by the first electronic equipment and the audio quality information of the first voice instruction received by the second electronic equipment.

In one possible design, after the first electronic device responds to the first voice command, before recording the second voice command input by the user, the method may further include: in the recording process of the first electronic equipment, the first electronic equipment does not detect a second voice instruction input by a user within a preset time period, and the first electronic equipment deletes the stored recording data and continues recording; the method comprises the steps that a first electronic device calls a multi-turn conversation pause instruction to a second electronic device, wherein the multi-turn conversation pause instruction is used for indicating that multi-turn conversation is temporarily stopped; and the second electronic equipment deletes the saved recording data and continues recording.

In one possible design, the responding, by the first electronic device, the second voice command according to the recorded sound data of the first electronic device and/or the recorded sound data of the second electronic device may include: and the first electronic equipment determines the optimal radio equipment from the voice control system according to the audio quality information of the recording data of the first electronic equipment and the audio quality information of the recording data of the second electronic equipment. And when the optimal radio equipment is the first electronic equipment, the first electronic equipment responds to the second voice command according to the recording data of the first electronic equipment. And when the optimal radio equipment is second electronic equipment, the first electronic equipment responds to the second voice command according to the recording data of the second electronic equipment or according to the recording data of the second electronic equipment and the recording data of the first electronic equipment. Wherein the audio quality information is used to indicate the audio quality of the sound recording data.

In a possible design, the responding, by the first electronic device, the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device may include: and the first electronic equipment responds to the second voice instruction according to the audio content information of the recording data of the first electronic equipment and/or the audio content information of the recording data of the second electronic equipment. Wherein the audio content information is used to represent the audio content of the sound recording data.

In a third aspect, an embodiment of the present application provides a voice control method, where the voice control method may be applied to a second electronic device of a voice control system, where the voice control system may further include at least a first electronic device, and the voice control method may include: and recording by the second electronic equipment, storing the recording data, and recording a second voice instruction input by the user. And the second electronic equipment sends the recording data of the second electronic equipment to the first electronic equipment, the recording data of the second electronic equipment comprises the recording data of a second voice instruction which is recorded by the second electronic equipment and input by a user, and the recording data is used for responding to the second voice instruction by the first electronic equipment after responding to the first voice instruction.

In one possible design, the method may further include: and the second electronic equipment receives the first electronic equipment call pickup instruction, and the pickup instruction is used for the second electronic equipment to return the recording data of the second electronic equipment.

In one possible design, the method may further include: and the second electronic equipment determines that the first electronic equipment is response equipment of the voice control system according to the audio quality information of the first voice instruction received by the second electronic equipment and the audio quality information of the first voice instruction received by the first electronic equipment.

In one possible design, after the first electronic device answers the first voice instruction, the method may further include: in the recording process of the second electronic equipment, the second electronic equipment receives a multi-turn conversation pause instruction called by the second electronic equipment, wherein the multi-turn conversation pause instruction is used for indicating that multi-turn conversations are temporarily stopped; and the second electronic equipment deletes the saved recording data and continues recording.

In one possible design, the method may further include: and the second electronic equipment sends the audio quality information of the sound recording data of the second electronic equipment to the first electronic equipment.

In a fourth aspect, embodiments of the present application provide a voice control apparatus having a function of implementing any one of the possible designs of the second aspect or the second aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, for example, a transceiver unit or module, and a processing unit or module.

In a fifth aspect, embodiments of the present application provide a voice control apparatus having a function of implementing any one of the possible designs of the third aspect or the third aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, for example, a transceiver unit or module, and a processing unit or module.

In a sixth aspect, an embodiment of the present application provides an electronic device, which may include: one or more processors; one or more memories; wherein the one or more memories are to store one or more programs; the one or more processors are configured to execute the one or more programs to implement the method as set forth in the second aspect or any possible design of the second aspect.

In a seventh aspect, an embodiment of the present application provides an electronic device, which may include: one or more processors; one or more memories; wherein the one or more memories are to store one or more programs; the one or more processors are configured to execute the one or more programs to implement the method as described in the third aspect or any possible design of the third aspect.

In an eighth aspect, the present application provides a computer-readable storage medium, which is characterized by comprising a computer program, when the computer program is executed on a computer, the computer program causes the computer to execute the method according to the second aspect or any possible design of the second aspect.

In a ninth aspect, the present application provides a computer-readable storage medium, which is characterized by comprising a computer program, and when the computer program is executed on a computer, the computer program causes the computer to execute the method according to any one of the possible designs of the third aspect or the third aspect.

In a tenth aspect, an embodiment of the present application provides a chip, which includes a processor and a memory, where the memory is used to store a computer program, and the processor is used to call and execute the computer program stored in the memory to perform the method according to any one of the possible designs of the second aspect or the second aspect.

In an eleventh aspect, an embodiment of the present application provides a chip, which includes a processor and a memory, where the memory is configured to store a computer program, and the processor is configured to call and execute the computer program stored in the memory to perform the method according to any one of the possible designs of the third aspect or the third aspect.

In a twelfth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to perform the method according to the second aspect or any one of the possible designs of the second aspect.

In a thirteenth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to perform the method according to the third aspect or any one of the possible designs of the third aspect.

In a fourteenth aspect, an embodiment of the present application provides a voice control system, where the voice control system includes at least a first electronic device and a second electronic device that have a voice control function. The first electronic device is adapted to perform the method according to the second aspect or any one of the possible designs of the second aspect. The second electronic device is adapted to perform the method according to the third aspect or any one of the possible designs of the third aspect.

According to the voice control method and the electronic equipment, under the multi-equipment scene, the problem of frame loss of voice control in the multi-equipment scene is solved in a mode that cross-equipment communication direct recording is not carried out among the multiple equipment, and the accuracy of the voice control is improved. And then, the voice instruction input by the user is responded through the recording data of the multi-device cooperative reception, so that the problem of influence of the audio quality of the voice instruction picked up by the electronic device on the ASR recognition accuracy can be effectively solved, and the accuracy of voice control is improved.

Drawings

FIG. 1 is a schematic diagram of a speech control system according to an embodiment of the present application;

fig. 2 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a voice control method according to an embodiment of the present application;

fig. 4 is a schematic view of a scenario of a multi-device voice control provided in an embodiment of the present application;

fig. 5 is a schematic view of another scenario of multi-device voice control provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of another speech control method according to an embodiment of the present application;

fig. 7 is a schematic view of another scenario of multi-device voice control provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and the like, as referred to in the embodiments of the present application, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, nor order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The voice assistant: an application program constructed based on artificial intelligence helps a user to complete operations such as information query, equipment control, text input and the like by performing instant question-and-answer type voice interaction with the user by means of a voice semantic recognition algorithm. The voice assistant generally performs a staged cascade process, and provides service functions through basic workflows such as voice wake-up, voice front-end processing, Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Dialog Management (DM), Natural Language Generation (NLG), text-to-speech (TTS), and the like. Among other things, the speech front-end processing may include, but is not limited to, Speech Enhancement (SE). The ASR may take the speech signal after the SE noise reduction processing as an input, and output a textual description result of the user speech signal. ASR is the basis for a speech assistant application to accurately perform subsequent recognition processing tasks. The audio quality of the user speech signal input into the ASR directly determines the accuracy of the ASR recognition result. The voice control method can guarantee the accuracy and reliability of the user voice signal input into the ASR, so that the accuracy of the ASR recognition result is improved, and the subsequent recognition processing task is accurately completed.

Voice awakening: the electronic equipment receives and detects a specific user voice signal (namely a wakeup word) in a screen locking or voice assistant sleeping state, and activates or starts the voice assistant to enable the voice assistant to enter a voice signal input waiting state.

Echo Cancellation (AEC): a speech front-end processing technology eliminates noise generated by a microphone and a loudspeaker due to a receiving path generated by air in a sound wave interference mode, and can effectively relieve the problem of noise interference caused by the space reflection of sound waves or audio played by the loudspeaker.

In the multi-device voice control process, a plurality of electronic devices select a response device through mutual communication negotiation, and the response device identifies and responds to a voice signal of a user. The reason for the false identification in the processing mode has two aspects: audio quality and latency. For audio quality, due to the diversity and complexity of the usage scenarios, the user voice commands picked up and processed by the electronic device are inevitably interfered by various types of external noise and internal noise. The interference of the noise may affect the audio quality of the user voice command picked up by the electronic device. For example, the external noise may be noise such as an air conditioner fan around the device, an unrelated human voice, and the internal noise may be audio/video played by the electronic device itself. In addition, the distance and the orientation between the electronic device and the user, the placing posture of the electronic device, the performance of the microphone module, and the like also affect the audio quality of the user voice command picked up by the electronic device. When the audio quality of the user voice command picked up by the electronic device is poor, a false recognition may be caused. For time delay, in the process of negotiating and selecting response equipment by a plurality of electronic equipment, both communication time delay generated by cross-equipment communication among the plurality of electronic equipment and time delay generated by selection of the response equipment cause a frame loss problem, and further cause misidentification. For example, the time delay may cause the user to speak the voice signal "play song 112222" while the answering device recognizes only the voice signal "2222", i.e., does not receive and recognize the voice signal "play song 11", thereby preventing the answering device from accurately recognizing and responding to the user's voice command.

The voice control method can solve the problem of voice instruction misrecognition in the multi-device voice control process from improving the audio quality and/or reducing the time delay. Through the mode that cross-device communication is not carried out among a plurality of electronic equipment and the recording is directly started, the time delay generated by realizing multi-device awakening and data transmission through communication is eliminated, the influence of the time delay on the ASR recognition accuracy is eliminated, the problem of frame loss of voice control in a multi-device scene is solved, and the accuracy of the voice control is improved. By selecting one or more electronic devices from the plurality of electronic devices as the optimal sound receiving device, the audio quality of the sound recording data of the optimal sound receiving device is better than that of other electronic devices. And responding to the voice command input by the user based on the recording data of the optimal radio equipment. Through the cooperative reception of multiple devices, the problem of influence of the audio quality of the voice instruction picked up by the electronic device on the ASR recognition accuracy can be solved, and the accuracy of voice control is improved.

The voice control method can be applied to a multi-device scene. The multi-device scenario may include a scenario in which a user concurrently uses multiple electronic devices, or a scenario in which user voice interaction occurs within an effective operating range of multiple electronic devices. Wherein, a plurality of electronic equipment possess the speech control function respectively. The voice control function may be provided by a voice assistant. In the multi-device scene, after the user speaks the awakening word and the voice instruction, the method of the embodiment can ensure the accuracy and reliability of the voice instruction input into the ASR, so that the accuracy of the ASR recognition result is improved, the subsequent recognition processing task is accurately completed, and the response to the voice instruction is completed. The electronic equipment is more intelligent, and efficient and accurate interaction between the electronic equipment and a user is realized. Meanwhile, the use experience of the user is improved.

The voice instruction in the embodiment of the application refers to an instruction input to the electronic device by a user in a voice form. The voice instruction is used for enabling the electronic equipment to provide service functions of interactive dialogue, information inquiry, equipment control and the like for the user. For example, the voice instruction may be a voice signal input by a user through a microphone of the electronic device.

In some embodiments, a voice assistant may be installed in an electronic device to enable the electronic device to implement voice control functions. The voice assistant is typically in a dormant state. The user may voice wake up the voice assistant before using the voice control functionality of the electronic device. The voice signal that wakes up the voice assistant may be referred to as a wake-up word (or wake-up voice), among others. The wake-up word may be pre-registered in the electronic device. For example, the wake-up word may be "small E", and it is understood that the wake-up word may also be any other words or statements, which may be flexibly set according to requirements.

In addition, the voice assistant may be an embedded application in the electronic device (i.e., a system application of the electronic device), or may be a downloadable application. An embedded application is an application program provided as part of an implementation of an electronic device, such as a cell phone. The downloadable application is an application that may provide its own internet protocol multimedia subsystem (IMS) connection. The downloadable application may be pre-installed in the electronic device or may be a third party application downloaded by a user and installed in the electronic device.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a speech control system according to an embodiment of the present application. The voice control system may include a plurality of electronic devices, and the plurality of electronic devices satisfy one or more of the following conditions: the same wireless access point (such as a WiFi access point) is connected, or the same account is logged in, or the user sets the wireless access point in the same group, or the voice interaction of the user occurs in the effective working range of the plurality of electronic devices.

Therein, as an example, the voice control system may include three electronic devices, for example, a first electronic device 201, a second electronic device 202, and a third electronic device 203. The first electronic device 201, the second electronic device 202 and the third electronic device 203 are all provided with voice control functions, such as being equipped with voice assistants.

In some embodiments, the wake-up words for the first electronic device 201, the second electronic device 202, and the third electronic device 203 to wake up the voice assistant may be the same, such as "small E".

For example, the electronic devices described in the embodiments of the present application, such as the first electronic device 201, the second electronic device 202, and the third electronic device 203 described above, may be a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, a desktop computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) \ Virtual Reality (VR) device, a media player, a television, a smart speaker, a smart watch, a smart headset, and the like. The embodiment of the present application does not particularly limit the specific form of the electronic device. The specific structure of the electronic device may refer to the description of the corresponding embodiment of fig. 2.

In addition, in some embodiments, the first electronic device 201, the second electronic device 202, and the third electronic device 203 may be the same type of electronic device, for example, the first electronic device 201, the second electronic device 202, and the third electronic device 203 are all mobile phones. In other embodiments, the first electronic device 201, the second electronic device 202, and the third electronic device 203 may be different types of electronic devices, for example, the first electronic device 201 is a mobile phone, the second electronic device 202 is a smart speaker, and the third electronic device 203 is a television (as shown in fig. 1).

In the embodiment of the present application, a mode of directly starting recording without performing cross-device communication between the first electronic device 201, the second electronic device 202, and the third electronic device 203 solves the problem of frame loss of voice control in a multi-device scene, and improves the accuracy of voice control.

The first electronic device 201, the second electronic device 202, and the third electronic device 203 can record without being called by other devices (e.g., a center device), so that a decentralized recording manner is realized. The decentralized recording mode does not need to execute the process of selecting one device as calling device, can effectively eliminate the time delay generated by communication between devices, and improves the accuracy of subsequent voice control.

Then, based on one or more dimensions of the first electronic device 201, the second electronic device 202, and the third electronic device 203, such as device information and recording data of the first electronic device 201, the second electronic device 202, and the third electronic device 203, one or more electronic devices are selected as an optimal sound receiving device from the first electronic device 201, the second electronic device 202, and the third electronic device 203. And responding to the voice command input by the user based on the recording data of the optimal radio equipment. According to the embodiment of the application, the problem of influence of the audio quality of the voice instruction picked up by the electronic equipment on the ASR recognition accuracy rate can be solved through the multi-equipment cooperative radio reception.

In some embodiments, the voice control system may also include a server 204. The server 204 can provide intelligent voice services.

Please refer to fig. 2, which is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

As shown in fig. 2, the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic device. In other embodiments, an electronic device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, the baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in an electronic device may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication applied to electronic devices, including Wireless Local Area Networks (WLANs) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite Systems (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves. For example, in some embodiments of the present application, the wireless communication module 160 may interact with other electronic devices, such as sending energy information of the detected voice signal to the other electronic devices after detecting the voice signal matching the wake word. For example, the electronic device of the embodiment of the present application may communicate with other electronic devices through the mobile communication module 150 and/or the wireless communication module 160. For example, the first electronic device 201 sends a call pickup instruction or the like to the second electronic device 202 through the communication module 150 and/or the wireless communication module 160.

In some embodiments, antenna 1 of the electronic device is coupled to the mobile communication module 150 and antenna 2 is coupled to the wireless communication module 160 so that the electronic device can communicate with the network and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou satellite navigation system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device implements the display function through the GPU, the display screen 194, and the application processor, etc. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-o led, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the electronic device may include 1 or N display screens 194, with N being a positive integer greater than 1.

The electronic device may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device selects a frequency point, the digital signal processor is used for performing fourier transform and the like on the frequency point energy.

Video codecs are used to compress or decompress digital video. The electronic device may support one or more video codecs. In this way, the electronic device can play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can realize applications such as intelligent cognition of electronic equipment, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The data storage area can store data (such as audio data, phone book and the like) created in the using process of the electronic device. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic device answers a call or voice information, it can answer the voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When a call is placed or a voice message is sent or some event needs to be triggered by the voice assistant to be performed by the electronic device, the user can speak via his/her mouth near the microphone 170C and input a voice signal into the microphone 170C. The electronic device may be provided with at least one microphone 170C. In other embodiments, the electronic device may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and the like. For example, the electronic device of the embodiment of the present application may receive a voice instruction input by a user through the microphone 170C.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronics determine the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic device detects the intensity of the touch operation according to the pressure sensor 180A. The electronic device may also calculate the position of the touch from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 180B may be used to determine the motion pose of the electronic device. In some embodiments, the angular velocity of the electronic device about three axes (i.e., x, y, and z axes) may be determined by the gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyroscope sensor 180B detects a shake angle of the electronic device, calculates a distance to be compensated for by the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, the electronic device calculates altitude, aiding in positioning and navigation, from barometric pressure values measured by barometric pressure sensor 180C.

The magnetic sensor 180D includes a hall sensor. The electronic device may detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the electronic device is a flip, the electronic device may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 180E can detect the magnitude of acceleration of the electronic device in various directions (typically three axes). When the electronic device is at rest, the magnitude and direction of gravity can be detected. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device may measure distance by infrared or laser. In some embodiments, taking a picture of a scene, the electronic device may utilize the distance sensor 180F to range to achieve fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device emits infrared light to the outside through the light emitting diode. The electronic device uses a photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device. When insufficient reflected light is detected, the electronic device may determine that there are no objects near the electronic device. The electronic device can detect that the electronic device is held by a user and close to the ear for conversation by utilizing the proximity light sensor 180G, so that the screen is automatically extinguished, and the purpose of saving power is achieved. The proximity light sensor 180G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense the ambient light level. The electronic device may adaptively adjust the brightness of the display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access to an application lock, fingerprint photographing, fingerprint incoming call answering and the like.

The temperature sensor 180J is used to detect temperature. In some embodiments, the electronic device implements a temperature processing strategy using the temperature detected by temperature sensor 180J. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold, the electronic device performs a reduction in performance of a processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, the electronic device heats the battery 142 when the temperature is below another threshold to avoid abnormal shutdown of the electronic device due to low temperature. In other embodiments, the electronic device performs a boost on the output voltage of the battery 142 when the temperature is below a further threshold to avoid abnormal shutdown due to low temperature.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device at a different position than the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic device may receive a key input, and generate a key signal input related to user settings and function control of the electronic device.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be attached to and detached from the electronic device by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic equipment can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. The same SIM card interface 195 can be inserted with multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic equipment realizes functions of conversation, data communication and the like through the interaction of the SIM card and the network. In some embodiments, the electronic device employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device and cannot be separated from the electronic device.

The methods in the following embodiments may be implemented in an electronic device having the above hardware structure.

In the embodiment of the application, under the multi-device scene, the problem of frame loss of voice control in the multi-device scene is solved by directly starting recording without performing cross-device communication among the multiple devices, and the accuracy of the voice control is improved.

And then, selecting one or more electronic devices from the plurality of electronic devices as the optimal sound receiving device based on one or more dimensions of the device information, the recording data and the like of the plurality of electronic devices. And responding to the voice command input by the user based on the recording data of the optimal radio equipment. Through the selection of the optimal sound receiving equipment, the electronic equipment which meets at least one of clearest sound pickup (closest to a user), lowest noise interference condition (farthest from a noise source) or optimal SE processing effect (optimal microphone noise reduction performance or supporting AEC) is selected as a sound pickup inlet for being called by a voice assistant, and the problem of influence of the audio quality of a voice instruction picked up by the electronic equipment on the ASR recognition accuracy can be effectively solved. The device information may include, but is not limited to, static attribute information or dynamic attribute information of the electronic device. The static attribute information may include, but is not limited to, a device model, a system version, capability information of a microphone, and the like. The dynamic attribute information may include, but is not limited to, power information of the electronic device, headphone state information, microphone state information, speaker state information, audio quality information of the recorded data, and the like. Wherein the speaker status information may be used to indicate whether a speaker of the electronic device is occupied. The audio quality information is used to indicate whether the audio quality of the recorded sound data is good or bad. The specific form of the audio quality information may include one or more of sound intensity information, noise intensity information, signal-to-noise ratio information, and the like.

Fig. 3 is a flowchart illustrating a voice control method according to an embodiment of the present application. The embodiment takes three electronic devices shown in fig. 1, namely, a sound box 201, a television 202 and a mobile phone 203 as an example for illustration. As shown in fig. 3, the method of this embodiment may include:

step 401, the sound box 201, the television 202 and the mobile phone 203 respectively receive a first voice instruction input by a user.

The first voice instruction is used to wake up a voice assistant of the electronic device. For example, the first voice instruction may be the wake-up word "small E". In this embodiment, the first voice command is used to wake up the respective voice assistants of the sound box 201, the television 202, and the mobile phone 203.

For the electronic equipment provided with the voice assistant, under the condition that the electronic equipment does not have other software and hardware and uses the microphone to collect voice signals, the electronic equipment can monitor whether the voice signals are input by a user in real time through the microphone. In general, when a user wants to use a voice control function of an electronic apparatus, the user may generate sound within a sound pickup range of the electronic apparatus to input the generated sound to a microphone. At this time, if the electronic device does not have other software and hardware using the microphone to collect the voice signal, the electronic device may monitor the corresponding voice signal through the microphone, such as the first voice command.

For example, as shown in connection with FIG. 4, the user may speak the wake word "Small ESmall E" when they want to use the voice control function. If the sound-producing position of the user is located in the sound-collecting range of each of the sound box 201, the television 202, and the mobile phone 203, and no other software or hardware is using the microphone to collect the voice signal, the sound box 201, the television 202, and the mobile phone 203 can detect the first voice instruction corresponding to the wake-up word "small E" through the respective microphone.

Step 402, responding to the first voice instruction, the sound box 201, the television 202 and the mobile phone 203 respectively awaken their respective voice assistants, and start recording.

When the electronic equipment detects the first voice command, the electronic equipment wakes up the voice assistant in response to the first voice command. In an example, after the electronic device receives the first voice command, the first voice command may be checked to determine whether the received first voice command is a wakeup word registered in the electronic device. If the verification is passed, the received first voice command is a wake-up word, and the voice assistant is awakened. If the verification fails, it indicates that the received first voice command is not a wake-up word, and the electronic device may not wake up the voice assistant, i.e., maintain the sleep state of the voice assistant.

In this embodiment, when the sound box 201, the television 202, and the mobile phone 203 respectively detect the first voice command, the sound box 201, the television 202, and the mobile phone 203 respectively wake up their respective voice assistants and start recording. After the sound box 201, the television 202 and the mobile phone 203 start recording respectively, whether the user inputs other voice instructions can be detected through respective microphones, and when other voice instructions input by the user are detected, recording data are generated and stored in own equipment.

For example, as shown in fig. 3 and 4, after the sound box 201, the television 202, and the mobile phone 203 start recording, they respectively receive a second voice command input by the user. For example, the second voice instruction spoken by the user is "play song 112222". The sound box 201, the television 202 and the mobile phone 203 respectively record the second voice instruction to generate respective recording data, and the content of the recording data is "playing song 112222".

It should be noted that, in an implementation manner, the recording data may be generated every time the recording data is recorded for 0.5 s. However, 0.5 may also be other values, for example, 0.6, 1, etc., which are not necessarily illustrated in the embodiments of the present application. When the recording data is saved, the previous recording data may be overwritten with the new recording data, or the previous recording data and the new recording data may be saved without overwriting the previous recording data with the new recording data. The embodiment of the present application takes the example of storing the previous recording data and the new recording data as an example.

In some embodiments, the electronic device may further determine, according to the recording data, audio quality information corresponding to the recording data. In other words, the electronic device also performs quality evaluation on the recorded data of itself. The audio quality information may include one or more of sound intensity information, noise intensity information, signal-to-noise ratio information, and the like, as described above.

Taking the three electronic devices of this embodiment as an example, the sound box 201, the television 202, and the mobile phone 203 may respectively perform quality evaluation on their respective recording data, and determine audio quality information corresponding to their respective recording data.

Step 403, the sound box 201, the television 202 and the mobile phone 203 respectively execute response device selection, determine a response device, and the response device plays response voice corresponding to the first voice instruction.

The execution order of steps 402 and 403 is not limited by the size of the sequence number, and may be other execution orders. For example, answering machine selection is performed at the same time as the recording is started.

The answering device of the embodiment is used for playing answering voice corresponding to a voice instruction input by a user. For example, the answering device plays the answering voice corresponding to the first voice instruction, i.e. wakes up the answering voice, such as "i am at". And other electronic equipment which is not used as answering equipment wakes up the voice assistant but does not play answering voice corresponding to the voice instruction input by the user.

The electronic device can perform answering device selection and determine an answering device based on the audio quality information corresponding to the first voice instruction. According to an implementation manner, the electronic device can perform quality evaluation on the received first voice instruction, determine audio quality information corresponding to the received first voice instruction, and broadcast the audio quality information corresponding to the received first voice instruction and device information. The electronic equipment receives audio quality information corresponding to a first voice command received by the electronic equipment and equipment information of the electronic equipment, wherein the audio quality information is broadcasted by other electronic equipment. And the electronic equipment selects one electronic equipment as the response equipment according to the audio quality information and the equipment information of all the electronic equipment. For example, the electronic device with the best audio quality is selected as the answering device.

With reference to the example in step 402, when the sound box 201 detects the first voice instruction, the sound box 201 may further perform quality evaluation on the first voice instruction, determine audio quality information corresponding to the first voice instruction received by the sound box 201, and broadcast the audio quality information corresponding to the first voice instruction received by the sound box 201 and the device information of the sound box 201. In a similar processing manner, when the television 202 detects the first voice instruction, the television 202 may further perform quality evaluation on the first voice instruction, determine audio quality information corresponding to the first voice instruction received by the television 202, and broadcast the audio quality information corresponding to the first voice instruction received by the television 202 and the device information of the television 202. When the mobile phone 203 detects the first voice instruction, the mobile phone 203 may further perform quality evaluation on the first voice instruction, determine audio quality information corresponding to the first voice instruction received by the mobile phone 203, and broadcast the audio quality information corresponding to the first voice instruction received by the mobile phone 203 and the device information of the mobile phone 203. Thus, the sound box 201 can receive the audio quality information and the device information corresponding to the first voice command of the television 202 and the mobile phone 203, and the sound box 201 selects one electronic device as a response device among the sound box 201, the television 202 and the mobile phone 203 according to the audio quality information and the device information corresponding to the first voice command of the sound box 201, the television 202 and the mobile phone 203. Similarly, the television 202 may receive audio quality information and device information corresponding to the first voice instruction of the sound box 201 and the mobile phone 203, and the television 202 selects one electronic device as a response device among the sound box 201, the television 202, and the mobile phone 203 according to the audio quality information and the device information corresponding to the first voice instruction of the sound box 201, the television 202, and the mobile phone 203. The mobile phone 203 can receive the audio quality information and the device information corresponding to the first voice instruction of the sound box 201 and the television 202, and the mobile phone 203 selects one electronic device as a response device from the sound box 201, the television 202 and the mobile phone 203 according to the audio quality information and the device information corresponding to the first voice instruction of the sound box 201, the television 202 and the mobile phone 203. Here, the sound box 201, the television 202, and the cellular phone 203 are exemplified to determine that the sound box 201 is the answering device.

For example, as shown in fig. 4, the sound box 201 as an answering device plays a wake-up answering voice, such as "i am". While the tv 202 and the handset 203 do not play the wake-up response voice, the respective voice assistants of the tv 202 and the handset 203 are in the wake-up state and can record the voice as described in step 402 above.

It should be noted that, in the process of executing the answering machine selection, other information may also be combined to select the answering machine, such as the priority of each electronic device. In addition, the specific implementation manner of performing the selection of the answering device can also adopt other manners, and the embodiment of the application is not limited in the manner described above. For example, a response device during the last use by the user or a response device set by the user may be adopted as the response device of the present embodiment.

Step 404, the sound box 201 calls a sound pickup instruction to the television 202 and the mobile phone 203 respectively, and the sound pickup instruction is used for indicating that the recording data is returned.

After step 403, the loudspeaker 201 starts to perform the distributed sound reception task. The answering device may invoke a pick-up instruction to the other non-answering devices, respectively, the pick-up instruction for instructing the non-answering devices to return the recorded data to the answering device.

In connection with the above example of steps, the voice assistant of sound box 201 may invoke an interface between the voice assistant of television 202 and the voice assistant of sound box 201 to deliver pickup instructions to television 202. The voice assistant of the sound box 201 may invoke an interface between the voice assistant of the cell phone 203 and the voice assistant of the sound box 201 to deliver a sound pickup instruction to the sound box 201. The pick-up instruction may carry identification information of the answering machine. The identification information of the answering device may be a Media Access Control (MAC) address of the answering device. For example, the pickup instruction may carry identification information of the sound box 201 to instruct the television 202 to return the recording data to the sound box 201.

Step 405, the television 202 and the mobile phone 203 respectively send recording data to the sound box 201.

The answering device receives the recorded sound data sent by other non-answering devices. After other non-answering devices send respective recording data, the recording can be continued, and new recording data is sent to the answering devices.

In connection with the above example of steps, the television 202 sends the sound recording data of the television 202 to the sound box 201. The mobile phone 203 transmits the recording data of the mobile phone 203 to the sound box 201. The recorded data may include the second voice command. For example, the content of the recorded data is "play song 112222".

In an implementation manner, the sound box 201 performs quality evaluation on the received recording data of the television 202, and determines audio quality information corresponding to the recording data of the television 202. The sound box 201 evaluates the quality of the received recording data of the mobile phone 203, and determines audio quality information corresponding to the recording data of the mobile phone 203.

In another implementation manner, the sound box 201 may further receive audio quality information corresponding to the recording data of the television 202 sent by the television 202. The sound box 201 can also receive audio quality information corresponding to the recording data of the mobile phone 203 sent by the mobile phone 203.

And step 406, the sound box 201 determines the optimal sound receiving equipment among the sound box 201, the television 202 and the mobile phone 203 according to the audio quality information, and plays the response voice corresponding to the second voice instruction according to the recording data of the optimal sound receiving equipment.

The answering device selects an optimal radio device from the electronic devices according to the audio quality information corresponding to the recording data of the electronic devices (including the answering device and other non-answering devices), and performs processing such as SE (sequence error correction) and ASR (asynchronous receiver/transmitter) by using the recording data of the optimal radio device so as to correctly recognize the voice instruction input by the user and further accurately respond to the voice instruction input by the user. The accurate response to the voice command input by the user comprises playing response voice corresponding to the voice command input by the user. In some embodiments, accurately responding to the voice command input by the user may also include triggering an event corresponding to the execution of the voice command by the answering device or other non-answering device. The event may be playing a song, playing a video, placing a phone call, etc.

It should be noted that, in some embodiments, the sound box 201 may also send the sound recording data of the optimal sound receiving device to the server 204 shown in fig. 1, and the server 204 performs processing such as SE, ASR, etc. using the sound recording data of the optimal device, so as to correctly recognize the voice command input by the user, and further accurately respond to the voice command input by the user.

For example, as shown in fig. 4 and 5, although the user is closest to the mobile phone 203, the blower 205 may generate noise to affect the sound quality of the mobile phone 203 because the user is using the blower 205. According to the sound box 201 of the embodiment, the sound box 201 determines the optimal sound receiving device to be the sound box 201 among the sound box 201, the television 202 and the mobile phone 203 according to the audio quality information of the recording data of the sound box 201, the television 202 and the mobile phone 203. For example, as shown in fig. 5, speaker 201 may play the response voice "you will play song 112222 here". The multimedia assets of song 112222 may be provided by server 204 or cell phone 203.

Optionally, in another implementation manner, the sound box 201 may further play the response voice corresponding to the second voice instruction according to the recording data of the sound box and the recording data of the optimal radio device. For example, the sound box 201 may splice the recording data of itself and the recording data of the optimal radio equipment, and play the response voice corresponding to the second voice instruction based on the spliced recording data.

Optionally, after step 406, steps 404 to 406 may be executed again to process the new recorded data in a similar manner to correctly recognize the new voice command input by the user, so as to accurately respond to the new voice command input by the user.

Optionally, in some embodiments, the voice control method according to the embodiment of the present application may further process the new recording data through the following steps.

Step 407, the sound box 201 sends a recording stop instruction to the television 202 and the mobile phone 203 respectively.

The answering device sends a recording stopping instruction to other non-answering devices, wherein the recording stopping instruction is used for indicating to stop recording, and the recording data is discarded.

Step 408, the television 202 and the mobile phone 203 stop recording respectively, and discard the recording data.

Other non-answering devices stop recording based on the stop recording instruction to reduce power consumption.

For example, the sound box 201 sends a recording stop instruction to the television 202 and the mobile phone 203, respectively. The television 202 and the mobile phone 203 stop recording and discard recording data, respectively. For example, the recording data corresponding to the second voice command is discarded. Thereafter, a new voice command input by the user is received by the speaker 201. For example, the third voice command spoken by the user is "change a song". The sound box 201 records the third voice command to generate recording data, and the content of the recording data is "change one song". The sound box 201 uses the recorded voice data to perform processing such as SE and ASR to correctly recognize the voice command input by the user, and then accurately respond to the voice command input by the user. For example, the speaker 201 may play the response voice "good, switch songs for you", play the switched songs.

It should be noted that, in this embodiment, the answering device and the optimal radio receiving device are illustrated as the sound box 201, the answering device and the optimal radio receiving device may be the same device or different devices, for example, the answering device is the sound box 201, and the optimal radio receiving device is the television 202, which is not limited by the above examples in this embodiment of the application. When the answering device and the optimal radio device are different devices, the answering device can call the recording data of the optimal radio device.

In some embodiments, when the voice command received by the answering device is used to turn off the voice assistant, the answering device may stop calling the recorded data of other non-answering devices, then stop its distributed reception task, and discard the recorded data.

According to the embodiment of the application, when the plurality of electronic devices respectively receive the first voice command input by the user, the plurality of electronic devices respectively wake up the respective voice assistants and start recording, and the first voice command is used for waking up the voice assistants of the electronic devices. After the plurality of electronic devices negotiate to determine the answering device, the answering device may determine the optimal radio receiving device according to the recording data of each electronic device, and play the answering voice corresponding to the second voice instruction according to the recording data of the optimal radio receiving device. Different from the mode of starting recording after being called by the central equipment, the embodiment directly starts recording after being awakened from the electronic equipment, and does not depend on the calling of the central equipment, so that a decentralized cooperative sound receiving mode is realized. The voice recording is started before the response equipment is not determined, the recorded voice data is used for processing SE, ASR and the like, and the communication time delay between the equipment is effectively eliminated, so that the problem of frame loss of voice control caused by time delay in a multi-equipment scene is solved.

By using the recording data of the optimal radio equipment to perform SE, ASR and other processing, the voice instruction input by the user can be correctly identified, the voice instruction input by the user can be accurately responded, and the accuracy of voice control is improved.

The awakening and the sound receiving processes are combined, so that the audio recording starts in advance, the electronic equipment can evaluate the quality of the recording data of the electronic equipment, the audio evaluation speed of the electronic equipment can be increased, the time required by subsequent decision-making of the optimal sound receiving equipment is shortened, the processing flow of a voice control method is increased, and the voice control response speed is increased.

It should be noted that, in the embodiment of fig. 3, the voice assistant is awakened by using the awakening word and starts recording, which is not limited in this embodiment of the present application, and the embodiment of the present application may also trigger the recording of the electronic device in other manners without the awakening process, and improve the accuracy of the voice control based on the cooperative reception of multiple devices. For example, the other manner may be that the electronic device detects a human voice, or the electronic device detects a voice of a specific user, and the like, which are not necessarily illustrated in the embodiments of the present application. For a specific implementation manner of implementing the voice control method without triggering the electronic device to record by the wake-up process, similar to the embodiment shown in fig. 3, for example, after starting recording, the answering device calls the pickup instruction, the non-answering device returns recording data, the answering device determines the optimal radio device according to the recording data of each electronic device, and plays the answering voice corresponding to the second voice instruction according to the recording data of the optimal radio device. The principle and technical effects can be seen from the explanation of the above embodiments.

Fig. 6 is a flowchart illustrating another speech control method according to an embodiment of the present application. The embodiment takes three electronic devices shown in fig. 1, namely, a sound box 201, a television 202 and a mobile phone 203, and the answering device is the sound box 201 as an example for illustration. The present embodiment is a non-first call after the electronic device wakes up, for example, a second call, a third call, a fourth call, and so on of a multi-turn dialog of the voice assistant. As shown in fig. 6, the method of this embodiment may include:

step 701, the sound box 201 calls a plurality of turns of conversation pause instructions to the television 202 and the mobile phone 203 respectively, and the plurality of turns of conversation pause instructions are used for indicating that a plurality of turns of conversation are temporarily stopped.

The answering device does not detect a new voice command input by the user within a preset time period, namely, a time interval exists between the voice commands input by the user. The answering machine detects the time interval and triggers a plurality of dialog pause operations. The answering devices can respectively call other non-answering devices a plurality of turns of conversation pause instructions, and the plurality of turns of conversation pause instructions are used for indicating that a plurality of turns of conversation are temporarily stopped.

For example, the voice assistant of loudspeaker 201 may invoke an interface between the voice assistant of television 202 and the voice assistant of loudspeaker 201 to communicate multiple turns of the conversation pause instruction to television 202. The voice assistant of audio box 201 may invoke an interface between the voice assistant of cell phone 203 and the voice assistant of audio box 201 to communicate multiple turns of the conversation pause instruction to audio box 201. The sound box 201 deletes the previously saved recording data and continues to keep the recording.

Step 702, the television 202 and the mobile phone 203 delete the respective saved recording data and respectively keep recording.

The television 202 and the mobile phone 203 delete the recorded data before the call of the multi-turn conversation pause instruction, and keep recording.

Step 703, the sound box 201, the television 202 and the mobile phone 203 respectively receive a fourth voice instruction input by the user, and respectively record the fourth voice instruction to generate respective recording data.

In some embodiments, the sound box 201, the television 202, and the mobile phone 203 may further perform quality evaluation on the received recording data, and determine audio quality information corresponding to the received recording data.

For example, as shown in fig. 7, the fourth voice command spoken by the user may be "play movie 333333". The sound box 201, the television 202, and the mobile phone 203 respectively record the fourth voice command to generate respective recording data, and the content of the recording data is "play movie 33333333".

Step 704, the sound box 201 calls a sound pickup instruction to the television 202 and the mobile phone 203 respectively, and the sound pickup instruction is used for indicating that recording data are returned.

After the above step 703, the speaker 201 starts to re-execute the distributed sound reception task. The answering device may invoke a pick-up instruction to the other non-answering devices, respectively, the pick-up instruction for instructing the non-answering devices to return the recorded data to the answering device.

Step 705, the television 202 and the mobile phone 203 respectively send recording data to the sound box 201.

In connection with the above example of steps, the television 202 sends the sound recording data of the television 202 to the sound box 201. The mobile phone 203 transmits the recording data of the mobile phone 203 to the sound box 201. For example, the content of the recorded data is "play movie 333333".

Step 706, the sound box 201 determines the optimal sound receiving equipment among the sound box 201, the television 202 and the mobile phone 203 according to the audio quality information, and responds to the fourth voice instruction according to the recording data of the optimal sound receiving equipment.

The answering device selects an optimal radio device from the electronic devices according to the audio quality information corresponding to the recording data of the electronic devices (including the answering device and other non-answering devices), and performs processing such as SE (sequence error correction) and ASR (asynchronous receiver/transmitter) by using the recording data of the optimal radio device so as to correctly recognize the voice instruction input by the user and further accurately respond to the voice instruction input by the user. The accurate response to the voice command input by the user comprises playing response voice corresponding to the voice command input by the user. In some embodiments, accurately responding to the voice command input by the user may also include triggering an event corresponding to the execution of the voice command by the answering device or other non-answering device. The event may be playing a song, playing a video, making a phone call, etc.

For example, as shown in fig. 7, the sound box 201 of the present embodiment determines, according to the audio quality information of the recording data of the sound box 201, the television 202, and the mobile phone 203, that the optimal sound receiving device is the sound box 201 among the sound box 201, the television 202, and the mobile phone 203. For example, as shown in fig. 7, the sound box 201 may play the response voice "movie 333333 will be played on the tv", and the tv 202 starts playing the movie 33333333.

Thereafter, if the user triggers the multiple turn dialog pause again, the above steps 701 to 706 may be re-executed. In this process, the optimal sound reception equipment may change. For example, in connection with the example shown in fig. 7, after the tv starts playing the movie, the fifth voice command spoken by the user may be "sound dot". The sound box 201, the television 202 and the mobile phone 203 respectively record the fifth voice instruction to generate respective recording data, and the content of the recording data is the 'sound dot'. Then, through the process involved in the above steps, the optimal radio equipment is determined to be the television 202 in the sound box 201, the television 202 and the mobile phone 203. The speaker 201 may respond to the fifth voice command based on the recorded data of the television 202. According to the embodiment, when the environment of a user changes, different devices can be selected for receiving sound according to the recording effect. For example, after the television 202 starts playing a movie, a strong self-noise (such as a sound generated during playing of the movie) occurs in the home of the user, and at this time, the reception of the voice assistant of the sound box 201 may also be mixed into a sentence played by the television, and if the recording data of the sound box 201 is used, an ASR recognition error may be caused.

It should be noted that, in the embodiments shown in fig. 3 and fig. 6, the answering device selects the optimal sound receiving device according to the audio quality information, and answers the second voice command according to the recording data of the optimal sound receiving device, which is exemplified as an example, it may also be another processing manner, for example, the answering device answers the second voice command directly according to the received recording data, or according to the received recording data and its own recording data. The specific implementation manner of responding to the second voice instruction according to the received recording data and the recording data of the answering device may be that the answering device splices the audio content information of the received recording data and the audio content information of the recording data of the answering device, and responds to the second voice instruction based on the spliced audio content information. For example, the user speaks the voice signal "play song 112222", the answering machine recognizes only the voice signal "2222", the audio content information of the recording data of the answering machine is used to represent the voice signal "2222", the audio content information of the recording data of the other equipment received by the answering machine is used to represent the voice signal "play song 112", the answering machine may splice the two to obtain spliced audio content information, and the spliced audio content information is used to represent the voice signal "play song 112222".

Fig. 8 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus may be applied to an electronic device (such as the first electronic device 201 described above) of a voice control system, the voice control system may further include at least a second electronic device (such as the second electronic device 202 or a third electronic device 203), and the apparatus may include: a transceiver module 81 and a processing module 82. For example, the transceiver module 81 may specifically be the mobile communication module 150 and/or the wireless communication module 160 in the embodiment shown in fig. 2. The processing module 82 may be the processor 110 of the embodiment shown in fig. 2.

The transceiver module 81 is configured to receive a first voice command input by a user, and the processing module 82 is configured to respond to the first voice command. The transceiver module 81 is further configured to receive the recording data of the second electronic device sent by the second electronic device, where the recording data of the second electronic device includes recording data of a second voice instruction input by the user, by the second electronic device. The processing module 82 is further configured to respond to the second voice command according to the recording data of the first electronic device and/or the recording data of the second electronic device, where the recording data of the first electronic device includes recording data of the second voice command input by the user recorded by the first electronic device.

In some embodiments, the transceiver module 81 is further configured to invoke a sound pickup instruction to the second electronic device, where the sound pickup instruction is used for the second electronic device to return the sound recording data of the second electronic device.

In some embodiments, the processing module 82 is further configured to record a second voice command input by the user when or after the first electronic device receives the first voice command input by the user.

In some embodiments, the first voice instruction is for waking up a voice control function of the first electronic device and/or the second electronic device.

In some embodiments, the processing module 82 is further configured to determine that the first electronic device is a response device of the voice control system according to the audio quality information of the first voice instruction received by the first electronic device and the audio quality information of the first voice instruction received by the second electronic device.

In some embodiments, the processing module 82 is further configured to delete the saved recording data and continue recording during the recording process of the first electronic device, after the first electronic device answers the first voice command and before recording the second voice command input by the user, for a preset time period during which the second voice command input by the user is not detected. The transceiver module 81 is further configured to invoke a multi-turn dialog pause instruction to the second electronic device, where the multi-turn dialog pause instruction is used to instruct a multi-turn dialog to be temporarily stopped.

In some embodiments, the transceiver module 81 is further configured to receive audio quality information of the audio recording data of the second electronic device transmitted by the second electronic device.

In some embodiments, the processing module 82 is configured to determine an optimal sound receiving device from the voice control system based on the audio quality information of the sound recording data of the first electronic device and the audio quality information of the sound recording data of the second electronic device. And when the optimal radio equipment is the first electronic equipment, responding to the second voice instruction according to the recording data of the first electronic equipment. And when the optimal radio equipment is second electronic equipment, responding to the second voice instruction according to the recording data of the second electronic equipment or according to the recording data of the second electronic equipment and the recording data of the first electronic equipment. Wherein the audio quality information is used to indicate the audio quality of the sound recording data.

In some embodiments, the processing module 82 is configured to respond to the second voice command according to the audio content information of the sound recording data of the first electronic device and/or the audio content information of the sound recording data of the second electronic device. Wherein the audio content information is used to represent the audio content of the sound recording data.

The voice control apparatus of the embodiment of the present application may be configured to execute the step of the answering device (e.g., the sound box 201) in the embodiment of the method, and the technical principle and the technical effect of the voice control apparatus may refer to the explanation of the embodiment of the method, which is not described herein again.

Fig. 9 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application. As shown in fig. 9, the apparatus may be applied to an electronic device (such as the second electronic device 202 or the third electronic device 203) of a voice control system, the voice control system may further include at least a first electronic device (such as the first electronic device 201), and the apparatus may include: a transceiver module 91 and a processing module 92. For example, the transceiver module 91 may specifically be the mobile communication module 150 and/or the wireless communication module 160 in the embodiment shown in fig. 2. The processing module 92 may be the processor 110 of the embodiment shown in fig. 2.

The processing module 92 is configured to record and store the recording data, and the recording is configured to record a second voice instruction input by the user. The transceiver module 91 is configured to send the recording data of the second electronic device to the first electronic device, where the recording data of the second electronic device includes recording data of a second voice instruction input by the second electronic device, and the recording data is used for the first electronic device to answer the second voice instruction after answering the first voice instruction.

In a possible design, the transceiver module 91 is further configured to receive a pickup instruction invoked by the first electronic device, where the pickup instruction is used for the second electronic device to return recording data of the second electronic device.

In one possible design, the processing module 92 is configured to record the sound when or after the second electronic device receives the first voice command input by the user.

In a possible design, the processing module 92 is further configured to determine that the first electronic device is a response device of the voice control system according to the audio quality information of the first voice instruction received by the second electronic device and the audio quality information of the first voice instruction received by the first electronic device.

In a possible design, the processing module 92 is further configured to receive, through the transceiver module 91, a multiple turn dialog pause instruction invoked by the second electronic device during recording by the second electronic device after the first electronic device responds to the first voice instruction, where the multiple turn dialog pause instruction is used to instruct multiple turns of dialog to be temporarily stopped. The processing module 92 is also used for deleting the saved recording data and continuing recording.

In one possible design, the transceiver module 91 is further configured to send audio quality information of the sound recording data of the second electronic device to the first electronic device.

The voice control apparatus in the embodiment of the present application may be configured to execute the steps of any non-answering device (such as the television 202 or the mobile phone 203) in the embodiment of the method, and the technical principle and the technical effect of the voice control apparatus may refer to the explanation of the embodiment of the method, which is not described herein again.

Other embodiments of the present application further provide an electronic device, configured to perform the method of the electronic device in each of the above method embodiments. As shown in fig. 10, the electronic device may include: a microphone 1001, one or more processors 1002; one or more memories 1003; the various devices described above may be connected by one or more communication buses 1005. Wherein the memory 1003 stores one or more computer programs 1004, the one or more processors 1002 are configured to execute the one or more computer programs 1004, and the one or more computer programs 1004 include instructions that can be configured to perform the steps performed by any of the electronic devices in the above method embodiments. The electronic device may be any form of electronic device described above, such as a smartphone, a smartwatch, or the like.

Of course, the electronic device shown in fig. 10 may further include other devices such as a display screen, which is not limited in this embodiment. When it includes other devices, it may be specifically the electronic apparatus shown in fig. 2.

The electronic device of the embodiment of the present application may be configured to execute the steps of the electronic device in any one of the above method embodiments, and for technical principles and technical effects, reference may be made to the explanation of the above method embodiments, which is not described herein again.

Further embodiments of the present application also provide a computer storage medium, which may include computer instructions, and when the computer instructions are executed on an electronic device, the electronic device may be caused to perform the steps performed by the electronic device in the above method embodiments.

Further embodiments of the present application also provide a computer program product, which when run on a computer causes the computer to perform the steps performed by the electronic device in the above-mentioned method embodiments.

An embodiment of the present application further provides a voice control system, where the voice control system may at least include: the first electronic device may adopt the structure of the embodiment shown in fig. 8 or fig. 10, and the second electronic device may adopt the structure of the embodiment shown in fig. 9 or fig. 10, and accordingly, the technical solution of any one of the above method embodiments may be executed, which implements similar principles and technical effects, and is not described herein again.

Through the description of the foregoing embodiments, it will be clear to those skilled in the art that, for convenience and simplicity of description, only the division of the functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The processor mentioned in the above embodiments may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware encoding processor, or implemented by a combination of hardware and software modules in the encoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The memory referred to in the various embodiments above may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (personal computer, server, network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A voice control method is applied to a voice control system, the voice control system at least comprises a first electronic device and a second electronic device which have voice control functions, and the method comprises the following steps:

the method comprises the steps that first electronic equipment and second electronic equipment respectively receive a first voice instruction input by a user, and the first electronic equipment responds to the first voice instruction;

recording by the second electronic equipment, and storing recording data, wherein the recording is used for recording a second voice instruction input by a user;

the second electronic equipment sends the recording data of the second electronic equipment to the first electronic equipment;

the first electronic equipment responds to the second voice instruction according to the recording data of the first electronic equipment and/or the recording data of the second electronic equipment;

the recording data of the first electronic device comprises recording data of the second voice instruction which is recorded by the first electronic device and input by a user.

2. The method of claim 1, further comprising:

the first electronic equipment calls a pickup instruction to the second electronic equipment, and the pickup instruction is used for the second electronic equipment to return the recording data of the second electronic equipment.

3. The method of claim 1 or 2, wherein the second electronic device records sound, comprising:

and when or after the second electronic equipment receives a first voice instruction input by a user, recording by the second electronic equipment.

4. The method of claim 3, further comprising:

when or after the first electronic equipment receives a first voice instruction input by a user, the first electronic equipment records a sound, wherein the sound is used for recording a second voice instruction input by the user.

5. The method according to any one of claims 1 to 4, wherein the first voice instruction is used to wake up a voice control function of the first electronic device and/or the second electronic device.

6. The method according to any one of claims 1 to 5, further comprising:

and the first electronic equipment and the second electronic equipment respectively determine that the first electronic equipment is response equipment of the voice control system according to the audio quality information of the received first voice instruction.

7. The method of any of claims 1-6, wherein after the first electronic device answers the first voice command, prior to recording a second voice command entered by a user, the method further comprises:

in the recording process of the first electronic device and the second electronic device, the first electronic device does not detect a second voice instruction input by a user within a preset time period, and the first electronic device deletes the stored recording data and continues recording; the first electronic equipment calls a multi-turn conversation pause instruction to the second electronic equipment, wherein the multi-turn conversation pause instruction is used for indicating that multi-turn conversations are temporarily stopped; and the second electronic equipment deletes the stored recording data and continues recording.

8. The method according to any one of claims 1 to 7, further comprising:

and the first electronic equipment receives the audio quality information of the recording data of the second electronic equipment, which is sent by the second electronic equipment.

9. The method according to any one of claims 1 to 8, wherein the first electronic device responds to the second voice command according to the recorded voice data of the first electronic device and/or the recorded voice data of the second electronic device, and the method comprises:

the first electronic equipment determines the optimal radio equipment from the voice control system according to the audio quality information of the recording data of the first electronic equipment and the audio quality information of the recording data of the second electronic equipment;

when the optimal radio equipment is first electronic equipment, the first electronic equipment responds to the second voice instruction according to the recording data of the first electronic equipment or the recording data of the first electronic equipment and the recording data of the second electronic equipment;

when the optimal radio equipment is second electronic equipment, the first electronic equipment responds to the second voice instruction according to the recording data of the second electronic equipment or the recording data of the second electronic equipment and the recording data of the first electronic equipment;

wherein the audio quality information is used to represent the audio quality of the sound recording data.

10. The method according to any one of claims 1 to 8, wherein the first electronic device responds to the second voice command according to the recorded voice data of the first electronic device and/or the recorded voice data of the second electronic device, and the method comprises:

the first electronic equipment responds to the second voice instruction according to the audio content information of the recording data of the first electronic equipment and/or the audio content information of the recording data of the second electronic equipment;

wherein the audio content information is used for representing the audio content of the sound recording data.

11. A voice control method is applied to a first electronic device of a voice control system, the voice control system at least comprises a second electronic device, and the method comprises the following steps:

the first electronic equipment receives a first voice instruction input by a user, and responds to the first voice instruction;

the first electronic equipment receives recording data of the second electronic equipment sent by the second electronic equipment, wherein the recording data of the second electronic equipment comprises recording data of a second voice instruction input by a user recorded by the second electronic equipment;

and the first electronic equipment responds to the second voice command according to the recording data of the first electronic equipment and/or the recording data of the second electronic equipment, wherein the recording data of the first electronic equipment comprises the recording data of the second voice command which is recorded by the first electronic equipment and input by a user.

12. The method of claim 11, further comprising:

13. The method of claim 12, further comprising:

when or after the first electronic equipment receives the first voice instruction input by the user, the first electronic equipment records the voice, wherein the recording is used for recording the second voice instruction input by the user.

14. The method according to any one of claims 11 to 13, wherein the first voice instruction is used to wake up a voice control function of the first electronic device and/or the second electronic device.

15. The method according to any one of claims 11 to 14, further comprising:

and the first electronic equipment determines that the first electronic equipment is response equipment of the voice control system according to the audio quality information of the first voice instruction received by the first electronic equipment and the audio quality information of the first voice instruction received by the second electronic equipment.

16. The method of any of claims 11-15, wherein after the first electronic device answers the first voice command, prior to recording a second voice command entered by a user, the method further comprises:

in the recording process of the first electronic equipment, the first electronic equipment does not detect a second voice instruction input by a user within a preset time period, and the first electronic equipment deletes the stored recording data and continues recording; the first electronic device calls a multi-turn conversation pause instruction to the second electronic device, wherein the multi-turn conversation pause instruction is used for indicating that multi-turn conversations are temporarily stopped.

17. The method according to any one of claims 11 to 16, further comprising:

18. The method according to any one of claims 11 to 17, wherein the first electronic device responds to the second voice command according to the recorded sound data of the first electronic device and/or the recorded sound data of the second electronic device, and comprises:

when the optimal radio equipment is first electronic equipment, the first electronic equipment responds to the second voice instruction according to the recording data of the first electronic equipment;

19. The method according to any one of claims 11 to 17, wherein the first electronic device responds to the second voice command according to the recorded sound data of the first electronic device and/or the recorded sound data of the second electronic device, and comprises:

20. A voice control method, applied to a second electronic device of a voice control system, the voice control system further including at least a first electronic device, the method comprising:

and the second electronic equipment sends the recording data of the second electronic equipment to the first electronic equipment, the recording data of the second electronic equipment comprises the recording data of a second voice instruction input by the second electronic equipment, and the recording data is used for responding to the second voice instruction after the first electronic equipment responds to the first voice instruction.

21. The method of claim 20, further comprising:

and the second electronic equipment receives a pickup instruction called by the first electronic equipment, wherein the pickup instruction is used for returning the recording data of the second electronic equipment by the second electronic equipment.

22. The method of claim 20 or 21, wherein the second electronic device records sound, comprising:

23. The method of any one of claims 20 to 22, further comprising:

and the second electronic equipment determines that the first electronic equipment is the response equipment of the voice control system according to the audio quality information of the first voice instruction received by the second electronic equipment and the audio quality information of the first voice instruction received by the first electronic equipment.

24. The method of any of claims 20-23, wherein after the first electronic device answers the first voice instruction, the method further comprises:

in the recording process of the second electronic device, the second electronic device receives a multi-turn conversation pause instruction called by the second electronic device, wherein the multi-turn conversation pause instruction is used for indicating that multi-turn conversations are temporarily stopped; and the second electronic equipment deletes the stored recording data and continues recording.

25. The method of any one of claims 20 to 24, further comprising:

and the second electronic equipment sends the audio quality information of the recording data of the second electronic equipment to the first electronic equipment.

26. An electronic device, comprising: one or more processors and memory;

the memory coupled with the one or more processors for storing computer program code comprising computer instructions which, when executed by the one or more processors, cause the electronic device to perform the voice control method of any of claims 11-19 or cause the electronic device to perform the voice control method of any of claims 20-25.

27. A computer storage medium comprising computer instructions that, when run on an electronic device, cause the electronic device to perform the voice control method of any of claims 11 to 19 or cause the electronic device to perform the voice control method of any of claims 20 to 25.

28. A computer program product, characterized in that, when running on a computer, causes the computer to execute the voice control method according to any one of claims 11 to 19, or causes the computer to execute the voice control method according to any one of claims 20 to 25.

29. A voice control system, characterized in that the voice control system comprises at least a first electronic device and a second electronic device having a voice control function, and the voice control system is configured to execute the voice control method according to any one of claims 1 to 10.