CN111681654A

CN111681654A - Voice control method and device, electronic equipment and storage medium

Info

Publication number: CN111681654A
Application number: CN202010436761.7A
Authority: CN
Inventors: 王杰; 陈孝良; 李智勇; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2020-09-18

Abstract

The disclosure provides a voice control method, a voice control device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: sound collection is carried out through the sound collection equipment under the scene that a voice control module of the electronic equipment is in a dormant state and the electronic equipment is currently connected with the sound collection equipment; responding to the first personal sound signal collected by the sound collection equipment, receiving the first personal sound signal sent by the sound collection equipment, and awakening the voice control module; and executing the operation corresponding to the first human voice signal based on the first human voice signal through the voice control module. The voice control module of the electronic equipment is directly awakened by receiving the first personal sound signal acquired by the sound acquisition equipment, and a specific awakening word is not needed to awaken the voice control module. Moreover, after the sound collection equipment collects the first person sound signal, the first person sound signal is directly reported to the electronic equipment, so that the problem of poor voice control efficiency of the sound collection equipment is solved, and the voice control effect is improved.

Description

Voice control method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a voice control method and apparatus, an electronic device, and a storage medium.

Background

With the development of artificial intelligence, more and more electronic devices have a voice control function; that is, the user can control the electronic device to perform some operations through voice. Many users such as office workers, students, and driving workers in cities have bluetooth headsets, and how users control electronic equipment to execute operations is a key point of attention in the industry in a scene that the electronic equipment is connected with the bluetooth headsets.

In the related art, in a scene that the electronic device is connected with the bluetooth headset, a user wakes up the bluetooth headset through a specific wake-up word first, then the bluetooth headset performs sound collection, and the collected voice signal is matched with an offline command word stored locally. And responding to the matched command words, the Bluetooth headset sends the operation command corresponding to the human voice signal to the electronic equipment, and the electronic equipment executes the operation corresponding to the human voice signal according to the operation command.

However, before the bluetooth headset collects the voice signal, the user needs to wake up the bluetooth headset through the wake-up word, which is not in line with the natural process of human interaction, so that the voice control process is complicated and the voice control efficiency is low. In addition, the command words included in the offline command word library in the Bluetooth headset are limited; therefore, the bluetooth headset can only match a small number of operation commands, that is, the user can only control the electronic device to execute some specific operations, so that the user cannot obtain humanized and intelligent voice interaction experience, and the voice control effect is poor.

Disclosure of Invention

The embodiment of the disclosure provides a voice control method, a voice control device, an electronic device and a storage medium, which can reduce the voice control process and improve the voice control efficiency. The technical scheme is as follows:

in one aspect, a method for controlling voice is provided, the method comprising:

sound collection is carried out through the sound collection equipment under the scene that a voice control module of the electronic equipment is in a dormant state and the electronic equipment is currently connected with the sound collection equipment;

responding to the first personal sound signal collected by the sound collection equipment, receiving the first personal sound signal sent by the sound collection equipment, and awakening the voice control module;

and executing the operation corresponding to the first human voice signal based on the first human voice signal through the voice control module.

In a possible implementation manner, the performing, based on the first vocal signal, an operation corresponding to the first vocal signal includes:

responding to the fact that a first personal sound signal comprises a command word, carrying out intention identification on the first personal sound signal, and obtaining intention information of the first personal sound signal;

and responding to the intention information for triggering the electronic equipment to execute the target operation, and executing the target operation.

In another possible implementation manner, the method further includes:

determining a target application program for executing the intention information according to the intention information;

determining that the intention information is used to trigger the electronic device to perform a target operation in response to the target application being included on the electronic device.

In another possible implementation manner, before performing an operation corresponding to the first vocal signal based on the first vocal signal, the method further includes:

acquiring first voiceprint information of the first personal acoustic signal;

and in response to the first voiceprint information is matched with second voiceprint information preset in the electronic equipment, executing the operation corresponding to the first human voice signal based on the first human voice signal.

in response to no sound collecting a second human voice signal within a first preset time period before the first human voice signal is collected, executing an operation corresponding to the first human voice signal based on the first human voice signal; or,

and executing the operation corresponding to the first human voice signal based on the first human voice signal in response to that a third human voice signal is acquired within a second preset time before the first human voice signal is acquired and third voiceprint information of the third human voice signal is matched with first voiceprint information of the first human voice signal.

acquiring the first time when a display screen of the electronic equipment is touched last time;

determining a time difference between a current second time and the first time;

and executing the operation corresponding to the first human voice signal based on the first human voice signal in response to the time difference being greater than a third preset time length.

In another possible implementation manner, the performing, by the voice control module, an operation corresponding to the first vocal signal based on the first vocal signal includes:

sending the first human voice signal to a server through the voice control module, and receiving an operation instruction corresponding to the first human voice signal returned by the server;

and executing the target operation corresponding to the operation instruction.

In another aspect, a voice control apparatus is provided, the apparatus comprising:

the sound collection module is configured to collect sound through the sound collection device in a scene that a voice control module of the electronic device is in a dormant state and the electronic device is currently connected with the sound collection device.

And the awakening module is configured to respond to the first personal sound signal collected by the sound collection equipment, receive the first personal sound signal sent by the sound collection equipment and awaken the voice control module.

And the execution module is configured to execute the operation corresponding to the first human voice signal based on the first human voice signal through the voice control module.

In a possible implementation manner, the execution module is further configured to perform intention recognition on a first vocal signal in response to a command word included in the first vocal signal, so as to obtain intention information of the first vocal signal;

In another possible implementation manner, the execution module is further configured to determine, according to the intention information, a target application program for executing the intention information; determining that the intention information is used to trigger the electronic device to perform a target operation in response to the target application being included on the electronic device.

In another possible implementation manner, the apparatus further includes:

an acquisition module configured to acquire first voiceprint information of the first vocal signal;

the execution module is further configured to execute the step of executing the operation corresponding to the first vocal signal based on the first vocal signal in response to the first vocal print information matching with second vocal print information preset in the electronic device.

In another possible implementation manner, the executing module is further configured to, in response to that no sound is collected into a second human voice signal within a first preset time period before the first human voice signal is collected, execute the step of executing an operation corresponding to the first human voice signal based on the first human voice signal; or,

In another possible implementation manner, the execution module is further configured to acquire a first time when a display screen of the electronic device is touched last time; determining a time difference between a current second time and the first time; and executing the operation corresponding to the first human voice signal based on the first human voice signal in response to the time difference being greater than a third preset time length.

In another possible implementation manner, the execution module is further configured to send the first human voice signal to a server through the voice control module, and receive an operation instruction corresponding to the first human voice signal returned by the server; and executing the target operation corresponding to the operation instruction.

In another aspect, an electronic device is provided, and the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the operations performed in the voice control method in any one of the above possible implementations.

In another aspect, a computer-readable storage medium is provided, where at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operations performed by the electronic device in the voice control method in any one of the above possible implementation manners.

In another aspect, a computer program product is provided, which includes at least one computer program, and when being executed by a processor, is configured to implement the operations performed by an electronic device in the voice control method in any one of the above possible implementation manners.

The technical scheme provided by the embodiment of the disclosure has the following beneficial effects:

in the embodiment of the disclosure, sound collection is performed through the sound collection device in a scene that a voice control module of the electronic device is in a dormant state and the electronic device is currently connected with the sound collection device; responding to the first personal sound signal collected by the sound collection equipment, receiving the first personal sound signal sent by the sound collection equipment, awakening the voice control module, and executing operation corresponding to the first personal sound signal based on the first personal sound signal through the voice control module. The voice control module of the electronic equipment is directly awakened by receiving the first personal sound signal acquired by the sound acquisition equipment, and a specific awakening word is not needed to awaken the voice control module. Therefore, the process of voice control is reduced, the efficiency of voice control is improved, and the voice control is more natural. Moreover, after the sound collection equipment collects the first personal sound signal, the first personal sound signal is directly reported to the electronic equipment; therefore, matching of the operation command by the sound collection device is not required. Therefore, the problem of poor voice control efficiency caused by the fact that the voice acquisition equipment can only match a small number of operation commands is solved, and the voice control effect is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by embodiments of the present disclosure;

FIG. 2 is a flow chart of a voice control method provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of another speech control method provided by the disclosed embodiments;

FIG. 4 is a schematic diagram of a voice control method provided by an embodiment of the present disclosure;

FIG. 5 is a flow chart of another speech control method provided by the disclosed embodiments;

fig. 6 is a block diagram of a voice control apparatus provided in an embodiment of the present disclosure;

FIG. 7 is a block diagram of another voice-controlled apparatus provided by an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment provided by embodiments of the present disclosure. Referring to fig. 1, the implementation environment includes an electronic device 101 and a sound collection device 102. The electronic device 101 and the sound collection device 102 are connected via a wireless or wired network. The wireless connection may be a bluetooth connection, an NFC (Near Field Communication) connection, or the like.

The sound collection apparatus 102 may be a headset (e.g., a bluetooth headset), a Microphone, or other collection apparatus mounted with an MIC (Microphone) device. The electronic device 101 may be a computer, a mobile phone, a tablet computer, an intelligent robot, an intelligent sound box, an intelligent home, an intelligent toy, a vehicle-mounted terminal, or a television box, and the like having a voice control function.

In the embodiment of the present disclosure, in a scenario where the voice control module of the electronic device 101 is in a dormant state and the current sound collection device 102 is connected, the sound collection device 102 does not need to wake up through a specific wake-up word, and after the sound collection device 102 collects the first human voice signal, the first human voice signal is directly sent to the electronic device 101. The electronic device 101 directly wakes up the voice control module, and based on the first human voice signal, executes an operation corresponding to the first human voice signal.

In one possible implementation, the electronic device 101 directly recognizes an operation instruction of the first vocal signal, and then performs an operation corresponding to the first vocal signal based on the operation instruction. In another possible implementation, the operating instruction of the first vocal signal is recognized by the server. Correspondingly, the implementation environment further comprises a server 103; the electronic device 101 may be installed with a target application served by the server 103, and the target application transmits the first vocal signal to the server 103, and the server 103 recognizes an operation instruction of the first vocal signal. The target application may be a system application on the electronic device 101, or a third party application. For example, the target application may be an intelligent voice assistant.

It should be noted that the operation corresponding to the first personal acoustic signal may be any operation that can be performed by the electronic device. In another possible implementation manner, the operation corresponding to the first personal acoustic signal may be an inquiry operation performed by the electronic device; for example, the operation corresponding to the first personal acoustic signal may be "query route", "query knowledge", and the like. In another possible implementation manner, the operation corresponding to the first personal acoustic signal may also be that the electronic device feeds back the intention information; for example, the operation corresponding to the first personal acoustic signal may be "chat with the user" or the like. In another possible implementation manner, the operation corresponding to the first personal acoustic signal may also be that the electronic device opens the target application; for example, the operation corresponding to the first personal acoustic signal may be "open XX application".

Fig. 2 is a flowchart of a voice control method according to an embodiment of the present disclosure. Referring to fig. 2, the embodiment includes:

step 201, sound collection is performed through a sound collection device in a scene that a voice control module of the electronic device is in a dormant state and the electronic device is currently connected with the sound collection device.

Step 202, responding to the first human voice signal collected by the sound collection device, receiving the first human voice signal sent by the sound collection device, and waking up the voice control module.

And 203, executing an operation corresponding to the first human voice signal based on the first human voice signal through the voice control module.

In another possible implementation manner, the method further includes:

acquiring first voiceprint information of the first personal acoustic signal;

determining a time difference between a current second time and the first time;

and executing the target operation corresponding to the operation instruction.

Fig. 3 is a flowchart of another voice control method provided in the embodiments of the present disclosure. In the embodiment of the present disclosure, the description is given taking the example of the intention recognition of the first human voice signal by the electronic apparatus. Referring to fig. 3, this embodiment includes the steps of:

step 301, in a scenario that a voice control module of the electronic device is in a dormant state and the electronic device is currently connected to a sound collection device, the sound collection device collects sound.

In this step, no matter the sound collection device is in a sleep state or an awake state; the sound collection equipment collects sound and determines whether the sound signal is a human sound signal or not after collecting the sound signal; in response to the sound signal being a vocal signal, for the convenience of distinction, the collected vocal signal is referred to as a first vocal signal, and step 302 is performed. In response to the sound signal not being a vocal signal, the sound collection device may continue to collect sound until the first vocal signal is collected, and step 302 is performed.

In another possible implementation manner, the sound collection device may determine whether the sound signal is a human sound signal through fourth voiceprint information of the sound signal. Accordingly, the step of the sound collection device determining whether the sound signal is a human sound signal may be: the sound acquisition equipment acquires fourth voiceprint information of the sound signal and determines the type of the fourth voiceprint information; responsive to the type of the fourth voiceprint information being human; determining the sound signal to be a human sound signal; in response to the type of the fourth voiceprint information not being a human, it is determined that the sound signal is not a human sound signal.

In another possible implementation, the sound collection apparatus may determine whether the sound signal is a human sound signal by determining whether the sound signal includes a natural language signal. Accordingly, the step of the sound collection device determining whether the sound signal is a human sound signal may be: the sound collection equipment carries out natural sound signal detection on the sound signal, and determines that the sound signal is a human sound signal in response to the detection of the natural language signal; in response to no natural language signal being detected, it is determined that the sound signal is not a human sound signal. The natural sound signals may be in chinese, english, japanese, etc.

In the embodiment of the disclosure, the sound of the electronic device acquires the natural language signal, and the natural human voice signal belongs to the human voice signal sent by human, so that the effectiveness of the first human voice signal is improved, and the intelligence of the voice control of the electronic device is improved.

In one possible implementation, referring to fig. 4, the sound collection device may collect sound through VAD (Voice activity detection). The VAD can identify and eliminate the mute period in the voice signal, and only the voice signal in the non-mute period is obtained, so that the effectiveness of the voice signal of the voice acquisition is improved. And moreover, bandwidth resources for transmitting the human voice signal to the electronic equipment can be saved.

It should be noted that the electronic device includes a multi-dimensional sensor, and the electronic device may determine whether the electronic device is currently connected to the sound collection device through the multi-dimensional sensor.

The electronic equipment can determine that the voice control module is in a dormant state through the state of the voice control module; the electronic device can also determine that the voice control module is in the dormant state according to the state of the electronic device.

In one possible implementation manner, the electronic device executes an operation corresponding to the voice signal in response to that the voice control module is not currently based on the voice signal, and determines that the voice control module of the electronic device is in a dormant state.

In the embodiment of the disclosure, when the voice control module does not execute the operation corresponding to the voice signal, the electronic device performs voice acquisition on the voice signal in the current environment where the electronic device is located, so that the voice signal acquired by voice is prevented from colliding with the operation corresponding to the voice signal being executed by the electronic device, and the intelligence of voice acquisition of the electronic device is improved.

In another possible implementation manner, the electronic device determines that a voice control module of the electronic device is in a dormant state in response to the electronic device being in the dormant state. The electronic device may be in a sleep state, where a display screen of the electronic device is in a screen-off state. In another possible implementation, the electronic device being in the hibernation state may be that the electronic device is not currently performing an operation.

In the embodiment of the disclosure, when the electronic device is in a dormant state, the electronic device performs sound collection on the voice signal in the current environment of the electronic device, so that the voice signal collected by the sound is prevented from colliding with the operation being executed by the electronic device, and the intelligence of sound collection of the electronic device is improved.

In another possible implementation, the electronic device being in the hibernation state may be that the electronic device is not currently running an audio application. For example, the electronic device is not currently playing a song.

In the embodiment of the disclosure, when the electronic device is in a dormant state, the electronic device performs sound collection on the voice signal in the current environment of the electronic device, so that the voice signal obtained by sound collection is prevented from colliding with the running audio application of the electronic device, and the intelligence of sound collection of the electronic device is improved.

In the embodiment of the disclosure, the electronic device may acquire, through the target acquisition module, a human voice signal in an environment where the electronic device is currently located. The human voice signal in the environment where the electronic device is currently located may be a sound wave signal generated by vibration of any object. For example, the human voice signal may be a human voice signal emitted by a living body; but also a human voice signal generated by object friction, air flow, etc.

Step 302, the sound collection device collects a first vocal signal and sends the first vocal signal to the electronic device.

In one possible implementation, the sound collection device collects the first vocal signal and sends the complete first vocal signal to the electronic device. In another possible implementation manner, the sound collection device may also send the first personal sound signal to the electronic device in real time during the collection process. Correspondingly, the steps can be as follows: when the electronic equipment collects the sound signal, the sound signal is determined to be the human sound signal, the audio is transmitted to the electronic equipment, and when the subsequent sound signal is detected not to be the human sound signal, the transmission is stopped.

After the sound collection equipment collects the first personal sound signal, the first personal sound signal is directly sent to the electronic equipment, so that the sound collection equipment does not need to be awakened through a specific awakening word, the process of voice control is reduced, the efficiency of voice control is improved, and the voice control is more natural.

It should be noted that the judgment of the human voice signal may also be performed by the electronic device, that is, the sound signal is collected by the sound collecting device and then sent to the electronic device. Accordingly, step 302 may be replaced with: the sound collection device collects a sound signal and sends the sound signal to the electronic device.

Step 303, the electronic device receives the first personal sound signal sent by the sound collection device, and wakes up the voice control module.

In response to the sound signal being collected by the sound collection device, sending the sound signal to the electronic device; then this step may be: the electronic equipment receives a sound signal sent by sound acquisition equipment and determines whether the sound signal is a human sound signal; responding to the voice signal as a human voice signal, and waking up the voice control module; and the corresponding sound signal is not a human sound signal, and the voice control module is continuously kept in a dormant state.

After the electronic device wakes up the voice control module, the intention recognition of the first personal sound signal can be directly performed, that is, step 304 is performed; in another possible implementation, with continued reference to fig. 4, the electronic device may also first verify whether the first personal acoustic signal is a sound of the owner of the electronic device; in response to the first personal sound signal being a master sound of the electronic device, performing step 304; in response to the first vocal signal not being the owner's voice of the electronic device, discarding the first vocal signal.

The step of the electronic device verifying whether the first personal sound signal is the sound of the owner of the electronic device may be: the electronic equipment acquires first voiceprint information of a first person vocal signal; comparing the first voiceprint information with second voiceprint information preset in the electronic equipment; in response to the first voiceprint information matching the second voiceprint information, it is determined that the first vocal signal is a host voice of the electronic device. In response to the first voiceprint information not matching the second voiceprint information, the first vocal signal is discarded.

In the embodiment of the disclosure, the first voiceprint information of the first personal audio signal is matched with the second voiceprint information preset in the electronic device, so that the security of the electronic device is improved.

In another possible implementation manner, in a scenario where the electronic device is connected to the sound collection device, if the user does not speak all the time and speaks suddenly, it is possible to perform voice control on the electronic device; if the user is speaking all the time, it is likely to be chatting with others; the electronic device may discard the first personal audio signal. The process may be: in response to that no sound is collected in a first preset time period before the sound is collected into the first human-voice signal, the electronic device determines that the first human-voice signal is used for triggering the electronic device to perform an operation, and step 304 is executed.

The first preset time period may be any value between 1s and 10s, for example, the first preset time period is 3s, 4s, 5s, and the like; in the embodiment of the present disclosure, the first preset duration is not specifically limited, and may be set and changed as needed. The second voice signal may be a voice signal sent by the user, or a voice signal sent by another person except the user.

In the embodiment of the disclosure, in response to that no sound is collected in the first preset time before the sound is collected into the first personal sound signal, the electronic device determines that the first personal sound signal is not the chat content of the user or the self-speaking self-language of the user, so that the effectiveness of the first personal sound signal is improved, and the efficiency of voice control is increased.

In another possible implementation manner, in a scene that the electronic device is connected with the sound collection device, if the user chats with other users, the electronic device detects a vocal signal that is not of the same user, and discards the first vocal signal; if the user does not chat with other users, the electronic equipment detects the voice signals of the same user, and the voice signals can trigger the electronic equipment to execute the target operation. The process may be:

in response to that a third vocal print signal is acquired within a second preset time period before the first vocal print signal is acquired, and the third vocal print information of the third vocal print signal is matched with the first vocal print information of the first vocal print signal, executing step 304; and in response to the fact that the third vocal print information of the third vocal signal is not matched with the first vocal print information of the first vocal signal within a second preset time before the first vocal signal is collected, discarding the first vocal signal.

In the embodiment of the disclosure, the voiceprint information of the first vocal signal and the voiceprint information of the third vocal signal acquired by the electronic device are matched, and it is determined that the first vocal signal is not the chat content of the user, so that the effectiveness of the first vocal signal is improved, and the efficiency of voice control is increased.

In another possible implementation manner, in a scene where the electronic device is connected to the sound collection device, if the user always touches the display screen of the electronic device; the user is convenient to trigger the electronic equipment to execute the operation through the display screen; the first person's voice signal is discarded if the user says it is likely to be chatting, rather than triggering the electronic device to perform an operation. Accordingly, the process may be:

the method comprises the steps that the electronic equipment obtains the first time when a display screen is touched last time; determining a time difference between the current second time and the first time; responding to the time difference being greater than a third preset time length; step 304 is executed; and in response to the time difference not being greater than the third preset time period, discarding the first vocal signal.

In the embodiment of the present disclosure, when it is determined that the first vocal signal is likely to be used to trigger the electronic device to perform the operation, the electronic device performs step 304, so as to improve the effectiveness of the first vocal signal, and further improve the accuracy of the subsequent voice control.

In the embodiment of the present disclosure, the electronic device wakes up the voice control module to switch the voice control module from the sleep state to the wake-up state for the electronic device.

In one possible implementation manner, the awakening state of the voice control module is that a display screen of the electronic device is in a bright screen state; correspondingly, the electronic device wakes up the voice control module, including: the electronic equipment switches the display screen of the electronic equipment from the screen-off state to the screen-on state.

In another possible implementation manner, the wake-up state of the voice control module is a state in which the electronic device receives a first personal sound signal; correspondingly, the electronic device wakes up the voice control module, including: the electronic device switches the electronic device from a state of not receiving the first vocal signal to a state of receiving the first vocal signal.

In another possible implementation manner, the voice control module is in an awake state, which is a state in which the voice control module can execute an operation corresponding to the first human voice signal; correspondingly, the electronic device wakes up the voice control module, including: the electronic equipment switches the voice control module from the dormant state to a state in which operation corresponding to the first human voice signal can be executed.

And 304, in response to the first personal sound signal including the command word, the electronic device performs intention identification on the first personal sound signal through the voice control module to obtain intention information of the first personal sound signal.

The electronic equipment converts the first personal sound signal into first text information; performing semantic understanding according to the first text information to obtain semantic information of the first personal sound signal; determining whether the first personal sound signal comprises a command word or not according to the voice information; responsive to the first vocal signal including the command word, intent information of the first vocal signal is determined based on the speech information. In response to the command word not being included in the first vocal signal, the electronic device discards the first vocal signal.

In one possible implementation manner, the electronic device may perform semantic understanding on the first text information through NLP (Natural language processing); correspondingly, the electronic device performs semantic understanding on the first text information to obtain semantic information of the first vocal signal, and the semantic information includes: the electronic device sends first text information to the NLP. The NLP receives first text information; performing semantic understanding on the first text information, and determining semantic information corresponding to the first text information; and returning semantic information to the electronic equipment. And the electronic equipment receives the semantic information returned by the NLP to obtain the semantic information of the first human voice signal.

In another possible implementation, with continued reference to fig. 4, the electronic device performs speech recognition on the first vocal signal through an ASR (Automatic speech recognition) engine; wherein the ASR engine may convert the ASR speech into textual information. Correspondingly, the electronic device converts the first personal sound signal into the first text information, and the method comprises the following steps: the electronic device sends a first vocal signal to the ASR engine; the ASR engine receives the first human voice signal and converts the first human voice signal into first text information; the electronic equipment acquires first text information.

In the embodiment of the disclosure, the electronic device includes a command word in semantic information of the first vocal signal, and the first vocal signal is determined to be used for triggering the electronic device to execute the operation. Therefore, the effectiveness of the first human voice signal is improved, and the efficiency of voice control is improved.

In a possible implementation manner, after the electronic device converts the first personal sound signal into the first text information, the electronic device can directly perform semantic understanding on the first text information to obtain the semantic information; or removing invalid words in the first text information corresponding to the first human voice signal to obtain second text information, and performing semantic understanding on the second file information to obtain the semantic information.

In one possible implementation, the invalid words include isolated words; correspondingly, the electronic equipment eliminates the invalid words in the first text information, and the method comprises the following steps: the electronic equipment detects isolated words in the first text information; and in response to the detection of the isolated words, removing the isolated words to obtain second text information.

In another possible implementation, the null words include linguistic words; correspondingly, the electronic equipment eliminates the invalid words in the first text information, and the method comprises the following steps: the electronic equipment detects the language words in the first text information; and eliminating the language words in response to the detection of the language words to obtain second text information.

In the embodiment of the disclosure, the electronic device removes the invalid words in the first text information, so that the validity of the first text information is improved, and the efficiency of voice control is increased.

In a possible implementation manner, with continued reference to fig. 4, the electronic device determines whether the first vocal signal includes a command word, and performs intent recognition on the first vocal signal only if the first vocal signal includes the command word; if the command word is not included in the first vocal signal, the first vocal signal is discarded.

In a possible implementation manner, as long as the first vocal signal includes the command word, the step that the electronic device performs intention recognition on the first vocal signal through the voice control module to obtain intention information of the first vocal signal is performed. In another possible implementation manner, only if the first personal sound signal includes a command word in a command word bank of the electronic device, the electronic device performs intent recognition on the first personal sound signal through the voice control module to obtain intent information of the first personal sound signal.

The command word bank of the electronic equipment comprises command words which are required by at least one application of the electronic equipment to be capable of executing operation; for example, take-away applications are included in electronic devices; the command lexicon of the electronic device includes "order", etc. For another example, the electronic device includes a taxi taking application; the command lexicon of the electronic device comprises "taxi taking". As another example, the electronic device includes a navigation application; the command lexicon of the electronic device includes "navigation," "route query," etc.

In the embodiment of the application, only if the first personal sound signal includes a command word in a command word bank of the electronic device, the electronic device performs intention recognition on the first personal sound signal through the voice control module to obtain intention information of the first personal sound signal. Therefore, the application condition of the electronic equipment can be considered, the effectiveness of the first human voice signal is improved, and the accuracy of subsequent voice control is improved.

Step 305: the electronic device responds to the intention information and is used for triggering the electronic device to execute the target operation, and the electronic device executes the target operation.

In this step, with continued reference to fig. 4, the intention information may be an intention of chatting, backing-in, or the like, or an intention of triggering the electronic device to perform the target operation. Therefore, after acquiring the intention information, the electronic equipment determines whether the intention information is used for triggering the electronic equipment to execute the target operation; responding to the intention information to trigger the electronic equipment to execute the target operation, and executing the target operation by the electronic equipment; in response to the intention information not being used to trigger the electronic device to perform the target operation, the first vocal signal is discarded.

In a possible implementation manner, the electronic device may determine whether the intention information is used to trigger the electronic device to perform the target operation according to whether the electronic device can perform the operation corresponding to the intention information, that is, whether the electronic device installs the corresponding application program. Accordingly, the step of the electronic device determining whether the intention information is used for triggering the electronic device to perform the target operation may be: the electronic equipment determines a target application program for executing the intention information according to the intention information; in response to the target application program being included on the electronic equipment, determining that the intention information is used for triggering the electronic equipment to execute the target operation; in response to not including the target application on the electronic device, it is determined that the intent information is not for triggering the electronic device to perform the target operation.

In the embodiment of the disclosure, the electronic device is triggered to execute the operation only when the intention information of the first personal sound signal is intention information for triggering the electronic device to execute the operation; discarding the first person voice signal of the intentions of chatting, tucking the bottom and the like; therefore, the effectiveness of the first person sound signal is improved, and the efficiency of voice control is improved.

Fig. 5 is a flowchart of another voice control method provided in the embodiments of the present disclosure. In the embodiment of the present disclosure, the first person sound signal is described as an example of the intention recognition by the server. Referring to fig. 5, this embodiment includes the steps of:

step 501, in a scene that a voice control module of the electronic device is in a dormant state and the electronic device is currently connected with a sound collection device, the sound collection device collects sound.

This step is the same as step 301, and is not described herein again.

Step 502, the sound collection device collects a first vocal signal and sends the first vocal signal to the electronic device.

This step is the same as step 302 and will not be described herein again.

Step 503, the electronic device receives the first personal sound signal sent by the sound collection device, and wakes up the voice control module.

This step is the same as step 303 and is not described herein again.

Step 504, the electronic device sends the first personal sound signal to the server through the voice control module.

Step 505, the server receives the first vocal signal and determines an operation instruction corresponding to the first vocal signal.

The steps of determining the operation instruction of the first vocal signal by the server and determining the operation instruction corresponding to the first vocal signal by the electronic device are similar, and are not described herein again.

Step 506, the server sends the operation instruction to the electronic device.

In step 507, the electronic device receives the operation instruction and executes a target operation corresponding to the operation instruction.

In the embodiment of the disclosure, the server identifies the first person sound signal to obtain the operation instruction, the operation instruction is returned to the electronic device, and the electronic device executes the target operation according to the operation instruction, so that not only are resources of the electronic device saved, but also the accuracy can be improved.

Fig. 6 is a block diagram of a voice control apparatus according to an embodiment of the present disclosure. Referring to fig. 6, the apparatus includes:

the sound collection module 601 is configured to collect sound through the sound collection device in a scene that the voice control module of the electronic device is in a dormant state and the electronic device is currently connected with the sound collection device.

And the awakening module 602 is configured to respond to the first personal sound signal acquired by the sound acquisition device, receive the first personal sound signal sent by the sound acquisition device, and awaken the voice control module.

The executing module 603 is configured to execute, by the voice control module, an operation corresponding to the first vocal signal based on the first vocal signal.

In a possible implementation manner, the execution module 603 is further configured to perform intent recognition on the first vocal signal in response to that the first vocal signal includes a command word, so as to obtain intent information of the first vocal signal;

In another possible implementation manner, the execution module 603 is further configured to determine, according to the intention information, a target application program for executing the intention information; in response to the target application being included on the electronic device, the intent information is determined for triggering the electronic device to perform the target operation.

In another possible implementation, referring to fig. 7, the apparatus further includes:

an obtaining module 604 configured to obtain first voiceprint information of a first vocal signal;

the execution module 603 is further configured to, in response to the first voiceprint information matching with second voiceprint information preset in the electronic device, execute a step of executing an operation corresponding to the first vocal signal based on the first vocal signal.

In another possible implementation manner, the executing module 603 is further configured to, in response to that no sound is collected to the second human voice signal within a first preset time period before the first human voice signal is collected, execute a step of executing an operation corresponding to the first human voice signal based on the first human voice signal; or,

and executing the operation corresponding to the first human voice signal based on the first human voice signal in response to the fact that the third human voice signal is collected within a second preset time before the first human voice signal is collected and the third voiceprint information of the third human voice signal is matched with the first voiceprint information of the first human voice signal.

In another possible implementation manner, the execution module 603 is further configured to obtain a first time when the display screen of the electronic device was last touched; determining a time difference between the current second time and the first time; and executing the operation corresponding to the first human voice signal based on the first human voice signal in response to the time difference being greater than the third preset time length.

In another possible implementation manner, the execution module 603 is further configured to send a first personal sound signal to the server through the voice control module, and receive an operation instruction corresponding to the first personal sound signal returned by the server; and executing the target operation corresponding to the operation instruction.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that: in the voice control apparatus provided in the foregoing embodiment, when performing voice control, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the voice control apparatus and the voice control method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 8 shows a block diagram of an electronic device 800 according to an exemplary embodiment of the present disclosure. The electronic device 800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group audio Layer III, motion Picture Experts compression standard audio Layer 3), an MP4 player (Moving Picture Experts Group audio Layer IV, motion Picture Experts compression standard audio Layer 4), a notebook computer, or a desktop computer. Electronic device 800 may also be referred to by other names as user equipment, portable electronic device, laptop electronic device, desktop electronic device, and so on.

In general, the electronic device 800 includes: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the voice control methods provided by method embodiments in the present disclosure.

In some embodiments, the electronic device 800 may further optionally include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a touch screen display 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 804 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the electronic device 800; in other embodiments, the number of the display screens 805 may be at least two, and the at least two display screens are respectively disposed on different surfaces of the electronic device 800 or are in a folding design; in still other embodiments, the display 805 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the electronic device 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is configured to locate a current geographic location of the electronic device 800 to implement navigation or LBS (location based Service). The positioning component 808 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 809 is used to power the various components in the electronic device 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic device 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the touch screen 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the electronic device 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user on the electronic device 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side bezel of electronic device 800 and/or underneath touch display 805. When the pressure sensor 813 is disposed on the side frame of the electronic device 800, the holding signal of the user to the electronic device 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of electronic device 800. When a physical button or vendor Logo is provided on the electronic device 800, the fingerprint sensor 814 may be integrated with the physical button or vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch screen 805 based on the ambient light intensity collected by the optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 805 is increased; when the ambient light intensity is low, the display brightness of the touch display 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also known as a distance sensor, is typically disposed on the front panel of the electronic device 800. The proximity sensor 816 is used to capture the distance between the user and the front of the electronic device 800. In one embodiment, the processor 801 controls the touch display 805 to switch from the bright screen state to the dark screen state when the proximity sensor 816 detects that the distance between the user and the front surface of the electronic device 800 is gradually decreased; when the proximity sensor 816 detects that the distance between the user and the front surface of the electronic device 800 becomes gradually larger, the processor 801 controls the touch display 805 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 does not constitute a limitation of electronic device 800, and may include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.

In embodiments of the present disclosure, there is also provided a computer-readable storage medium, such as a memory, comprising instructions executable by a processor in an electronic device for implementing the method of voice control as in the various embodiments above. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an embodiment of the present disclosure, a computer program product is also provided, which includes at least one computer program for implementing the method of speech control as in the above embodiments when executed by a processor.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is intended to be exemplary only and not to limit the present disclosure, and any modification, equivalent replacement, or improvement made without departing from the spirit and scope of the present disclosure is to be considered as the same as the present disclosure.

Claims

1. A method for voice control, the method comprising:

the method comprises the following steps that sound collection is carried out through sound collection equipment under the scene that a voice control module of the electronic equipment is in a dormant state and the electronic equipment is connected with the sound collection equipment currently;

2. The method of claim 1, wherein performing the operation corresponding to the first vocal signal based on the first vocal signal comprises:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein prior to performing the operation corresponding to the first vocal signal based on the first vocal signal, the method further comprises:

acquiring first voiceprint information of the first personal acoustic signal;

5. The method of claim 1, wherein prior to performing the operation corresponding to the first vocal signal based on the first vocal signal, the method further comprises:

6. The method of claim 1, wherein prior to performing the operation corresponding to the first vocal signal based on the first vocal signal, the method further comprises:

determining a time difference between a current second time and the first time;

7. The method according to claim 1, wherein the performing, by the voice control module, an operation corresponding to the first vocal signal based on the first vocal signal comprises:

and executing the target operation corresponding to the operation instruction.

8. A voice control apparatus, characterized in that the apparatus comprises:

9. An electronic device, comprising a processor and a memory, wherein at least one instruction is stored in the memory, and wherein the instruction is loaded and executed by the processor to implement the operations performed by the voice control method according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to perform operations performed by the voice control method of any one of claims 1 to 7.