CN112201246A

CN112201246A - Intelligent control method and device based on voice, electronic equipment and storage medium

Info

Publication number: CN112201246A
Application number: CN202011308093.6A
Authority: CN
Inventors: 何海亮
Original assignee: Shenzhen Oribo Technology Co Ltd
Current assignee: Shenzhen Oribo Technology Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-01-08
Anticipated expiration: 2040-11-19
Also published as: CN112201246B

Abstract

The embodiment of the application discloses an intelligent control method, an intelligent control device, electronic equipment and a storage medium based on voice, which relate to the field of voice recognition and comprise the following steps: acquiring mode information of a current scene; detecting whether first voice information input by a user comprises a keyword corresponding to the mode information; when the detection result is that the first control instruction is detected to be some time, determining a corresponding first control instruction according to the keyword; and executing the first control instruction. According to the embodiment of the application, the mode information of the current scene is obtained, when the first voice information is detected to comprise the keyword corresponding to the mode information, the first control instruction can be determined according to the keyword, so that a user can input the control instruction through the keyword corresponding to the mode information, the situation that the user needs to speak the awakening word before each time of sending the voice instruction is avoided, and the use experience of the user is effectively improved.

Description

Intelligent control method and device based on voice, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech-based intelligent control method and apparatus, an electronic device, and a storage medium.

Background

With the gradual development of the voice recognition technology, more and more terminal devices have the voice interaction function, and the terminal devices can understand the voice information of the user to realize the control of the terminal devices. However, the current terminal device usually requires the user to input a preset wake-up word to wake up the device, and then perform interaction between the user and the terminal device. When a user needs to input a control instruction, a wakeup word needs to be input first, so that the interaction cost is increased, and the interaction experience of the user is reduced. How to optimize the voice interaction process so as to improve the voice interaction experience of the user is an urgent problem to be solved.

Disclosure of Invention

In view of the foregoing, the present application provides a voice-based intelligent control method, apparatus, electronic device and storage medium to improve the foregoing problems.

In a first aspect, an embodiment of the present application provides a speech-based intelligent control method, where the method includes: acquiring mode information of a current scene; detecting whether first voice information input by a user comprises a keyword corresponding to mode information; when the detection result is that the first control instruction is detected to be some time, determining a corresponding first control instruction according to the keyword; and executing the first control instruction.

Further, before obtaining the mode information of the current scene, the method includes: acquiring second voice information input by a user; if the second voice message comprises a preset awakening word, executing awakening operation, wherein the preset awakening word is different from the keyword; and acquiring a scene corresponding to the second voice information, and taking the scene as the current scene.

Further, each scene comprises at least one mode information, each mode information comprises at least one keyword and a first control instruction corresponding to the keyword

Further, detecting whether a first voice message input by a user comprises a keyword corresponding to the mode information, wherein the detecting comprises the steps of carrying out voiceprint recognition on the first voice message to obtain a target voiceprint characteristic; acquiring the similarity between the target voiceprint feature and the specified voiceprint feature, wherein the specified voiceprint feature is the voiceprint feature of the second voice message; if the similarity is larger than a first preset threshold, detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information.

Further, detecting whether a keyword corresponding to the mode information is included in the first voice information input by the user includes: analyzing first voice information input by a user based on a preset acoustic model to acquire acoustic feature similarity of the first voice information and a keyword corresponding to mode information; and if the acoustic feature similarity is greater than a second preset threshold, judging that the first voice message comprises a keyword corresponding to the mode information.

Further, when detecting that the first voice message does not have the keyword corresponding to the mode information, detecting whether the first voice message contains a preset awakening word; if yes, performing semantic analysis on the first voice information to acquire a second control instruction; and executing the second control instruction.

Further, after semantic parsing is performed on the first voice information to obtain the second control instruction, the method includes: acquiring a scene type corresponding to the second control instruction; and if the scene type is different from the current scene, switching the current scene into a scene corresponding to the scene type.

In a second aspect, an embodiment of the present application provides a speech-based intelligent control apparatus, including: the acquisition module is used for acquiring the mode information of the current scene; the detection module is used for detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information; the instruction determining module is used for determining a corresponding first control instruction according to the keyword when the detection result is that the first control instruction is detected to be some time; and the execution module is used for executing the first control instruction.

In a third aspect, the present application provides an electronic device, comprising: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications being configured to perform the method of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having program code stored therein, the program code being invoked by a processor to perform the method of the first aspect.

The embodiment of the application discloses an intelligent control method and device based on voice, electronic equipment and a storage medium, which can acquire the current scene of terminal equipment, and a user can send a control instruction in the current scene only through a keyword, so that the situation that the user needs to speak a wakeup word before sending a voice instruction every time is avoided. Specifically, mode information of a current scene is obtained, wherein the mode information is configuration information used for voice interaction in the current scene; detecting whether first voice information input by a user comprises keywords corresponding to mode information or not, wherein each keyword can correspond to a control instruction of a current scene, and a terminal device can acquire the control instruction of the user by detecting the keywords; when the detection result is that the first control instruction is detected to be some time, determining a corresponding first control instruction according to the keyword; and executing the first control instruction. Therefore, the terminal equipment can acquire the control instruction input by the user by detecting the keyword instead of recognizing the control instruction in the voice only after detecting the awakening word, and the user can send the corresponding control instruction by the keyword, so that the condition that the user needs to speak the awakening word before each time of sending the voice instruction is avoided, and the use experience of the user is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic diagram of an application environment suitable for the embodiment of the present application.

Fig. 2 shows a flowchart of a method of a voice-based intelligent control method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a method of a voice-based intelligent control method according to another embodiment of the present application.

Fig. 4 is a flowchart illustrating step S350 of a speech-based intelligent control method according to another embodiment of the present application.

Fig. 5 is a flowchart illustrating a method for controlling a voice-based intelligent control method according to another embodiment of the present application.

FIG. 6 is a block diagram illustrating a voice-based intelligent control device according to an embodiment of the present disclosure;

fig. 7 shows a block diagram of an electronic device according to an embodiment of the present application.

Fig. 8 illustrates a storage unit for storing or carrying program codes for implementing the voice-based intelligent control method according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

With the continuous development of voice interaction technology, more and more terminal devices can interact by recognizing user voice data. In voice interaction, in order to reduce false wake-up, one voice wake-up is usually adopted to correspond to one voice recognition, that is, a user needs to input a preset wake-up word each time when inputting a voice control command, and when the voice input by the user is matched with the preset wake-up word, the terminal device performs voice recognition on the voice input by the user to acquire the control command corresponding to the voice. For example, when a user wants to interact with the device, the user needs to speak "AA" to wake up the terminal device in the sleep state, and then speak a control command "please play music", or speak a control command "AA with a wake-up word to please play music", and if "please play music" is directly spoken, the device cannot recognize the voice of the user. Therefore, when the user needs to input the control command, the user needs to input the awakening word first, and the interaction experience of the user is reduced.

In order to solve the above problems, the inventors have long studied and proposed a voice-based intelligent control method, apparatus, electronic device, and storage medium in the embodiments of the present application. The method and the device for controlling the voice information of the current scene acquire the mode information of the current scene, detect whether the first voice information input by the user comprises a keyword corresponding to the mode information, and determine a corresponding first control instruction according to the keyword when the first voice information input by the user is detected to be the keyword, and execute the first control instruction. According to the embodiment of the application, the mode information of the current scene can be acquired, when the first voice information is detected to comprise the keyword corresponding to the mode information, the first control instruction is determined according to the keyword, so that a user can send the corresponding control instruction only through the keyword, the situation that the user needs to speak the awakening word before each time the user sends the voice instruction is avoided, and the use experience of the user is effectively improved. The terminal equipment can directly execute the first control instruction in the state of avoiding awakening, so that the control efficiency of the terminal equipment is improved to a certain extent.

In order to better understand the voice-based intelligent control method, apparatus, electronic device, and storage medium provided in the embodiments of the present application, an application environment suitable for the embodiments of the present application is described below.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The virtual trial method provided by the embodiment of the present application can be applied to the polymorphic interaction system 100 shown in fig. 1. The polymorphic interaction system 100 includes a terminal device 101 and a server 102, the server 102 being communicatively coupled to the terminal device 101. The server 102 may be an individual server, or a server cluster, or a local server, or a cloud server, which is not limited specifically herein.

The terminal device 101 may be various electronic devices with a voice interaction device, including but not limited to smart home devices, smart audio devices, smart gateways, robots, vehicle-mounted devices, smart phones, tablet computers, laptop portable computers, desktop computers, wearable electronic devices, and the like. Specifically, the terminal device 101 may include a voice input module such as a microphone, a voice output module such as a speaker, and a processor. The voice interaction device may be built in the terminal device 101, or may be an independent module, and communicates with the terminal device 101 through an API or other methods. As one mode, a character input module for inputting characters, an image input module for inputting images, a video input module for inputting videos, and the like may also be provided on the terminal device 101 to implement multi-modal interaction.

Wherein, the terminal device 101 may be installed with a client application program, and the user may communicate with the server 102 based on the client application program, specifically, the server 102 is installed with a corresponding server application program, and the user may register a user account at the server 102 based on the client application program and communicate with the server 102 based on the user account, for example, a user logs into a user account at a client application, and enters through the client application based on the user account, text information, voice information, image information or video information and the like can be input, and after the client application program receives the information input by the user, the information may be sent to the server 102 so that the server 102 may receive the information, process and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 101 according to the information.

In some embodiments, after acquiring the reply information corresponding to the information input by the user, the terminal device 101 may display the reply information on a display screen of the terminal device 101 or other image output device connected thereto. As a mode, while the reply information is displayed, the corresponding audio may also be played through the speaker of the terminal device 101 or other audio output devices connected thereto, and the text or graphic corresponding to the reply information may also be displayed on the display screen of the terminal device 101, so as to implement multi-state interaction with the user in multiple aspects of image, voice, text, and the like.

In some embodiments, the server 102 may provide a recognition service for the voice data received at the terminal device 101 to obtain a text representation of the voice data input by the user, and obtain a representation of the user's intention based on the text representation, thereby generating a corresponding control instruction, and returning the control instruction to the terminal device 101. And the terminal equipment 101 provides services for the user according to the corresponding operation of the control instruction. Such as playing a song, making a call, setting an alarm clock, etc.

In some embodiments, the terminal device 101 may be a terminal device such as an intelligent sound, an intelligent gateway, an intelligent home panel, and the terminal device may be connected to at least one controlled device, where the controlled device may include but is not limited to an air conditioner, a floor heating device, a fresh air, a curtain, a lamp, a television, a refrigerator, an electric fan, and the terminal device and the intelligent home device may be connected through bluetooth, WiFi, or ZigBee. The terminal device 101 identifies the information input by the user after acquiring the information input by the user, and controls the controlled device to respond according to the control instruction when determining that the information is the control instruction of the controlled device. For example, the intelligent gateway is connected with a curtain, a user inputs a voice command 'AA' to open the curtain, and the intelligent gateway recognizes the voice command after detecting the awakening word 'AA' to control the curtain to open.

In some embodiments, the means for processing the information input by the user may also be disposed on the terminal device 101, so that the terminal device 101 can interact with the user without relying on establishing communication with the server 102, and in this case, the polymorphic interaction system 100 may only include the terminal device 101.

The above application environments are only examples for facilitating understanding, and it is to be understood that the embodiments of the present application are not limited to the above application environments.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method of a voice-based intelligent control method according to an embodiment of the present application, where the method includes steps S210 to S240.

Step S210: and acquiring mode information of the current scene.

The mode information is configuration information for voice interaction corresponding to the current working scene of the terminal equipment. Specifically, the mode information corresponding to each scene may include one or more of a keyword that can be used as a control command, an acoustic detection model for detecting voice information, a voice signal strength threshold for responding, a duration for waiting for a user to input voice information, and the like.

The terminal device may store at least one scene and mode information corresponding to each scene, and a correspondence between the scene and the mode information may be stored in the terminal device or the server after being preset by a user, may also be preset and stored by default when the terminal device leaves a factory, and may also be sent to the terminal device after being preset by the server, which is not limited herein.

In some embodiments, a plurality of scenes may be set for the terminal device according to the actual use requirements of the user. As one way, a plurality of scenes may be set according to different functions. For example, the scene is set as an application scene of music playing, news playing, shopping, ticket booking, smart home device control and the like. Alternatively, a plurality of scenes may be set according to different interactive objects. For example, the scene may be set as a multi-person interactive scene and a single-person interactive scene according to the number of interactive objects, or may be set as an interactive scene for the elderly, children, adults, and the like according to the properties of the interactive objects. As still another way, a plurality of work scenes may also be set according to external environment information. For example, the scene may be set to a use scene such as a morning scene, an evening scene, a night scene, or the like according to time. The setting of the scene is not limited herein.

In some embodiments, one mode information may be set for each scene, and when the terminal device operates in the scene, voice interaction is performed based on the mode information corresponding to the scene. For example, the keywords that can be used as control commands in the mode information of different scenes may be different, so as to implement that the keywords are input by voice in different scenes to execute the corresponding control commands in the scenes. For another example, the voice signal strength thresholds that can respond in the mode information of different scenes may be different, and specifically, a higher voice signal strength threshold may be set in the mode information of the multi-person interactive scene to prevent false wake-up, and a lower voice signal strength threshold may be set in the mode information of the single-person interactive scene to improve the sensitivity of response.

In some embodiments, each scene includes at least one mode information, and each mode information includes at least one keyword corresponding to a control instruction for performing control. As a mode, for a scene of a discontinuous conversation, after the voice waits for more than a preset time after the device is awakened, the control instruction input by the user can still be obtained by detecting the keyword, without sending the voice stream input by the user to the server, and performing voice recognition and semantic analysis on the whole voice spoken by the user to obtain the control instruction input by the user intention. Therefore, for a discontinuous conversation scene, the command control of the wake-up-free word can be realized while the power consumption is reduced.

As a mode, a plurality of mode information can be set for each scene, and by further subdividing the current working scene, it is possible to realize that there are a plurality of different modes for voice interaction with a user in the same working scene. For example, for a plurality of scenes obtained by function setting, a plurality of types of mode information may be set for each scene according to different interaction objects, and each mode information corresponds to a different keyword. For example, the scene controlled by the smart home device may include different mode information of a child, an adult, and the like, different devices having control authority corresponding to different mode information are different, and keywords corresponding to different mode information are also different, specifically, the user may set that the child is not allowed to use entertainment facilities such as a television and a projector, and then the child mode information does not include a keyword "turn on television" that can be used as a control command, while the adult mode has control authority of the entertainment facilities, and the keyword corresponding to the adult mode includes "turn on television".

According to the method and the device, the mode information of the scene is acquired, the voice interaction in the scene can be carried out in different scenes according to different mode information, and therefore the voice interaction experience of a user is improved. Further, the terminal device may also obtain mode information of the current scene in various ways.

In some embodiments, the terminal device may obtain mode information of a current scene by obtaining current environment information, where the current environment information may be illumination information obtained by a sensor of the terminal device, current time information obtained by a clock of the terminal device, or noise information of the current scene obtained by the sound collection device.

In other embodiments, the terminal device may obtain the mode information of the current scene by obtaining the voice information input by the user, specifically, please refer to the following embodiments.

In still other embodiments, a selection control may be configured on the terminal device, and as a mode, a user may directly select a current working scene through the selection control of the working scene, so as to obtain mode information corresponding to the scene. Alternatively, the user may select, via the mode selection control, in which way to obtain the mode information of the current scene. For example, when a user selects a voice mode, the terminal device acquires mode information of a current scene by acquiring voice information input by the user; when the user selects the voice mode and the environment mode, the mode information of the current scene may be acquired in combination with the acquired voice information and the current environment information. For example, when the user selects the voice mode and the environment mode, when the user inputs "AA, plays a song bar of zhou jieren", recognizes that the user intends to play music, and acquires the current time as night through the clock, the mode information corresponding to the music played at night may be acquired.

Step S220: whether a keyword corresponding to the mode information is included in the first voice information input by the user is detected.

After the mode information of the current scene is acquired, the terminal equipment can continuously acquire the voice signal input by the user through the voice acquisition module. As one mode, a Voice detection module may be disposed in the terminal device, and the Voice detection module detects a Voice signal input by a user and acquired by a Voice acquisition module such as a microphone by using Voice Activity Detection (VAD). Optionally, the integrity of the acquired voice signal can be determined by time delay compensation after the voice signal is detected, so as to avoid missing part of the voice signal.

In some embodiments, after the first voice information input by the user is acquired, certain preprocessing operation may be performed on the first voice information, and then detection may be performed to determine whether the first voice information includes a keyword corresponding to the mode information. The preprocessing operation may include noise suppression processing, echo cancellation processing, signal enhancement processing, and the like, and the accuracy of detection may be improved through the preprocessing operation.

Specifically, by acquiring the mode information of the current scene, the terminal device may detect only the keyword corresponding to the mode information of the current scene. It is understood that the greater the distinctiveness between keywords, the higher the accuracy of keyword detection. Which may be expressed as the length of the keyword, the difference between syllables included in each keyword, etc. By the method, the number of the keywords to be detected in each scene is small, so that the required power consumption is low, and the number of the keywords to be detected is reduced, so that the recognition error rate caused by low distinctiveness among the keywords is reduced to a certain extent.

In some embodiments, each scene may include a plurality of mode information, and the mode information corresponding to the current scene may be determined according to the acquired first voice information input by the user. For example, in a scene controlled by the smart home device, multiple types of mode information may be set according to different interactive objects, each mode information corresponds to a different keyword, a current interactive object may be determined by recognizing the voice information, and then a keyword corresponding to the current mode information is determined according to the interactive object, so as to detect whether the first voice information includes the keyword. For example, the scene controlled by the smart home device may include mode information of children, adults, and the like, the mode information of children does not have the control authority of the entertainment facility, that is, does not have keywords related to the entertainment facility, and when it is determined that the interactive object is a child by detecting the obtained voice, keywords related to the entertainment facility such as "turn on television" are not detected.

The method for detecting whether the first voice information input by the user includes the keyword corresponding to the mode information may be implemented in various manners, for example, using an acoustic model, based on template matching, using a neural network, and the like. The embodiment of the present application does not limit this.

In some embodiments, the first voice information input by the user may be analyzed based on a preset acoustic model to obtain acoustic feature similarity between the first voice information and a keyword corresponding to the mode information, and whether the keyword corresponding to the mode information is included in the first voice information is determined according to the acoustic feature similarity, and if the acoustic feature similarity is greater than a second preset threshold, it is determined that the keyword corresponding to the mode information is included in the first voice information.

The second preset threshold is a preset numerical value, the higher the second preset threshold is, the higher the accuracy of detecting the keywords is, and the lower the response sensitivity of the terminal equipment is; the lower the second preset threshold value is, the lower the accuracy of detecting the keyword is, and accordingly, the higher the response sensitivity of the terminal device is. For example, when the keyword is "cut song", it is determined that the speech information includes the keyword only if the user clearly and accurately says "cut song" two words when the second preset threshold is higher, and in some cases, even if the user says the keyword, the similarity of the acoustic features may be lower due to accent or noise, and it is determined that the keyword is not included; when the second preset threshold is lower, some similar voice messages may be determined to include the keyword, for example, "qige" may also be erroneously determined to include "song cutting", and "song cutting" voice with accent may also be determined to include "song cutting".

It can be understood that, since the keywords may be different in the mode information of different scenes, the modes of different scenes may correspond to different acoustic models, where each acoustic model includes an acoustic feature corresponding to each keyword in the mode information of the scene.

As a mode, the speech signal intensity thresholds corresponding to the mode information of different scenes may be different, and when the obtained speech signal intensity input by the user is greater than the intensity threshold corresponding to the scene, the keyword is detected through the acoustic model corresponding to the scene. For example, a higher voice signal intensity threshold value can be set in the mode information of the multi-person interaction scene to reduce the probability that the terminal device is mistakenly awakened in a noisy environment, and a lower voice signal intensity threshold value can be set in the mode information of the single-person interaction scene to improve the sensitivity of the response.

As one approach, the signal strength threshold corresponding to each keyword in each acoustic model may be the same. As another mode, each keyword in the acoustic model corresponds to a different signal intensity threshold, and specifically, a higher signal intensity threshold may be set for a keyword with poor distinctiveness, and a lower signal intensity threshold may be set for a keyword with good distinctiveness, so that a lower false wake-up probability and a higher response sensitivity are further considered in each scene.

The acoustic model may be a model stored in the terminal device or a model in the server.

As a mode, when the acoustic model is pre-stored in the terminal device, the terminal device can perform voice recognition on the first voice information without communicating with the server, and detection of the keyword can still be achieved when the terminal device is in a situation of network signal difference or network disconnection. Specifically, based on the acoustic model, the acoustic features in the first speech information may be extracted, and the similarity between the acoustic features of the first speech information and the acoustic features corresponding to the keywords may be calculated. For example, the mel-frequency cepstrum coefficient extracted from the first speech signal may be used as the acoustic feature, and the maximum likelihood ratio of the acoustic feature between the first speech information and the keyword corresponding to the pattern information may be used as the acoustic feature similarity. Specifically, each feature point of the acoustic features in the first speech information may be obtained, similarity comparison may be performed on each feature point of the acoustic features corresponding to the keyword, and then the similarities of all the feature points are integrated to obtain a maximum likelihood value as the acoustic feature similarity.

Alternatively, a first acoustic model may be set in the terminal device, and a second acoustic model may be set in the server, where the first acoustic model is detected with a lower accuracy than the second acoustic model. And when the acoustic feature similarity of the keywords corresponding to the first voice information and the mode information is judged to be larger than a specified value based on the first acoustic model, sending the first voice information to the second acoustic model for further detection. By the method, the acoustic detection with higher precision can be carried out only when the acoustic feature similarity detected by the first acoustic model is greater than a specified value, so that the power consumption required by the accurate detection can be reduced, and the accuracy of the detection is improved.

Step S230: if the detection is detected, the corresponding first control command is determined according to the keyword.

The first control instruction may be an instruction for controlling the terminal device that acquires the first voice information, or an instruction for controlling other controlled devices connected to the terminal device. Specifically, the control instruction may include a controlled device and a business skill of the controlled device, and the business skill may be a control skill, a query skill, or the like according to different scenarios. The first control command is not limited herein. The user can directly input the first control instruction by inputting the keyword corresponding to the mode information through voice, and does not need to input a wake-up word to wake up the terminal equipment every time the control instruction needs to be input, and then identify the control instruction in the voice information of the user.

In some embodiments, the first control instruction may be determined only by a keyword. Specifically, the terminal device stores a corresponding relationship between the keyword and the control instruction in the mode information of each scene, and the corresponding relationship may be stored in the terminal device after being preset by a user, or may be stored in the terminal device locally or in a server after being preset by default when the terminal device leaves a factory. According to the detected keyword corresponding to the mode information included in the first voice information and the corresponding relation between the keyword and the control instruction stored in the terminal equipment, the terminal equipment can determine the first control instruction corresponding to the keyword in the first voice information input by the user. By the method, the terminal equipment can acquire the control instruction corresponding to the first voice information without performing semantic analysis on the first voice information, so that the efficiency of determining the first control instruction is improved. Therefore, even if the terminal device is located in a poor network environment or a disconnected network environment, the first control instruction can be determined according to the keyword.

As a way, different keywords may correspond to the same control instruction, that is, the same control instruction may correspond to a plurality of different keywords having similar semantics, and in this way, the user may use different keywords to realize the same control instruction. For example, in a scenario of music playing, the keywords corresponding to the control instruction for switching to the next song may be "song cutting" and "next song". It should be noted that the same keyword in different scenes may correspond to different control instructions. For example, the control instruction corresponding to the keyword "turn on light" in the working scene is to turn on a smart desk lamp of a desk, and the control instruction corresponding to the keyword "turn on light" in the entertainment scene is to turn on a color atmosphere lamp of a room.

In other embodiments, after detecting that the first voice information includes a keyword corresponding to the mode information, the terminal device may perform semantic recognition on the first voice information to obtain a first control instruction corresponding to the keyword. Specifically, after the first voice information is converted into a text by an Automatic Speech Recognition (ASR), a Natural Speech Understanding operation (NLU) is performed on the text to analyze the first voice information, and the first control instruction is determined according to an analysis result.

As one way, the terminal device may be provided with a multi-turn interaction mode, and by turning on the multi-turn interaction mode, the first control instruction may be determined by combining the keyword and the interaction content of the terminal device before the first voice message is acquired. The multi-round interaction mode is a continuous conversation mode which can continuously acquire user voice and respond. For example, when it is determined that the current scene is a scene in which a video is played by acquiring the voice information "i want to watch an art", the keyword is set to "play", and if the voice of "playing a gadget of actor a" is acquired at this time, it is determined that the control command is to play the gadget of actor a by recognition and the command is executed, and then the voice of "playing actor B" is acquired, the terminal device may determine that the gadget of actor B is played instead of playing the art of actor B in conjunction with the previous interactive contents. In this way, the user does not need to repeatedly input the previous voice information, and the interaction efficiency can be improved.

Step S240: and executing the first control instruction.

In some embodiments, the first control instruction may be an instruction for controlling the terminal device that acquires the first voice information, and the terminal device may directly execute the first control instruction. For example, the terminal device is an intelligent sound box, the first control instruction is music playing, and the terminal device directly executes the first control instruction to play music.

In other embodiments, the first control instruction may also be an instruction for controlling other controlled devices connected to the terminal device, where the other controlled devices may be devices locally connected to the terminal device in a bluetooth, WiFi, or ZigBee manner, or may also be WiFi devices connected to the terminal device under the same WiFi. The terminal device may send the first control instruction to the controlled device corresponding to the first control instruction, and instruct the controlled device to execute the first control instruction.

In some embodiments, the terminal device may further acquire an execution result of the first control instruction, and output response information based on the execution result. Wherein, the response information can be at least one of sound, image or acousto-optic combination. For example, in an intelligent home control scene, a user inputs voice information containing a keyword of "open an air conditioner" to a control panel, the control panel determines that a control instruction corresponding to the keyword is sent to the air conditioner, after the air conditioner performs corresponding operation, an execution result can be fed back to the control panel, and the control panel can inform the user of successful opening of the air conditioner through a voice prompt mode or inform the user of successful opening of the air conditioner through vibration, light flashing and the like. As one mode, when the first control instruction fails to be executed, response information may also be output to feed back the execution failure to the user. For example, when the air conditioner is not connected to the control panel, the user may be informed that the air conditioner is not connected and cannot execute the control command input by the user.

It should be noted that, for parts not described in detail in this embodiment, reference may be made to the foregoing embodiments, and details are not described herein again.

According to the voice-based intelligent control method provided by the embodiment, after the mode information of the current scene is acquired, whether the first voice information input by the user includes the keyword corresponding to the mode information or not can be detected, and when the detection is that the keyword is included, the corresponding first control instruction is determined according to the keyword, and the first control instruction is executed. According to the voice-based intelligent control method, the keywords corresponding to the mode information in the voice information can be detected, the control instruction is determined according to the keywords, the terminal equipment can directly execute the first control instruction in the state of being free of awakening, the control efficiency of the terminal equipment is improved to a certain extent, the user can send the corresponding control instruction through the keywords, the situation that the user needs to speak the awakening words before each time of sending the voice instruction is avoided, and the use experience of the user is effectively improved.

Referring to fig. 3, a flowchart of a method for controlling a voice-based intelligent control method according to an embodiment of the present application is shown, and the method can be seen from fig. 3 and includes steps S310 to S370.

Step S31O: and acquiring second voice information input by the user.

In the dormant state, the voice acquisition module of the terminal device can continuously acquire external sound, and when the user sends out second voice information, the terminal device acquires the second voice information input by the user. The dormant state is a state before the terminal device performs voice interaction with the user, when the user does not perform interaction behavior within a preset waiting time after interacting with the terminal device, the terminal device can also be switched into the dormant state, most functional modules in the terminal device in the dormant state are in a state of stopping working, and consumed power is low.

Step S320: and if the second voice message comprises the preset awakening word, executing awakening operation.

After the second voice information input by the user is obtained, preset awakening word detection can be performed on the second voice information based on the awakening word detection model. The preset awakening words can be preset awakening words preset by default when the terminal device leaves a factory, or can be awakening words set by a user, and the preset awakening words and the keywords are different words. For example, the preset wake-up word may be a word such as "AA" that is not related to the control command, and the keyword may be a word such as "turn on the light", "turn on the air conditioner" that indicates the intention of the user to control. When the second voice message includes a preset wake-up word, the terminal device in the dormant state can be switched to the wake-up state, wherein the acquired voice message can be further identified in the wake-up state.

In some embodiments, the terminal device may be detected based on a local wake-up word detection model of the terminal device, so as to save power consumption required by the terminal device to perform real-time voice detection in a sleep state. Specifically, after the acoustic feature of the second voice message is extracted based on the awakening word detection model, the acoustic feature similarity between the second voice message and the preset awakening word is calculated, and then whether the second voice message includes the preset awakening word is judged. When the acoustic feature similarity between the second voice message and the acoustic feature is greater than a preset awakening threshold, it can be determined that the second voice message includes a preset awakening word, and the terminal device executes awakening operation.

It should be noted that, the implementation principle of the wakeup word detection model used for detecting the preset wakeup word is the same as that of the acoustic model used in step S220 for acquiring the acoustic feature similarity of the keyword corresponding to the first voice message and the mode information, but the difference is that the wakeup word detection model of the terminal device is the same model in the sleep state and in different scenes, and the preset wakeup word detection is continuously performed on the acquired voice message; the acoustic model in step S220 is a different model under the mode information of different scenes, and only when the terminal device is in the scene, the detection of the keyword is performed based on the acoustic model corresponding to the scene.

Step S330: and acquiring a scene corresponding to the second voice information, and taking the scene as a current scene.

In some embodiments, the obtained second voice information input by the user may be subjected to voice recognition and semantic analysis to obtain a scene corresponding to the second voice information, and the scene is taken as a current scene.

As one mode, the corresponding relationship between the rule template and the scene may be preset, after the terminal device collects the second voice information input by the user, the terminal device may send the second voice information to the server, convert the voice of the second voice information into a text by the server through ASR, and execute NLU on the text to acquire the scene corresponding to the second voice information. Specifically, the scene corresponding to the second speech information may be identified by matching with a rule template, text classification, information extraction, and the like.

Step S340: and acquiring mode information of the current scene.

Step S350: whether a keyword corresponding to the mode information is included in the first voice information input by the user is detected.

Referring to fig. 4, in some embodiments, step S350 may include steps S351 to S353.

Step S351, performing voiceprint recognition on the first voice message to acquire a target voiceprint feature.

Wherein, different voice messages correspond to different voiceprint characteristics, and each voiceprint characteristic corresponds to a user to be responded. Optionally, voiceprint feature recognition may be performed on the obtained speech based on the trained neural network model, and a manner of performing the voiceprint recognition is not limited herein.

As one mode, when the first voice information acquired by the terminal device only contains one voiceprint feature, the voiceprint feature is taken as a target voiceprint feature. As another mode, when the voice information acquired by the terminal device includes a plurality of different voiceprint features, a plurality of first voice information and a target voiceprint feature corresponding to each first voice information may be acquired.

And S352, acquiring the similarity between the target voiceprint characteristics and the specified voiceprint characteristics.

Wherein the specified voiceprint feature can be a voiceprint feature of the second speech information. The terminal device can match the target voiceprint feature with the specified voiceprint feature to obtain the similarity between the target voiceprint feature and the specified voiceprint feature. As one mode, when the voice information acquired by the terminal device includes a plurality of different target voiceprint features, the similarity between each target voiceprint feature and the specified voiceprint feature is calculated respectively.

In some embodiments, the specified voiceprint feature may be a preset voiceprint feature, and by storing at least one preset voiceprint feature of the user as the preset voiceprint feature in advance, only the voice information of the preset user can be responded, so that other users cannot interact freely, and further, the power consumption of recognition is saved.

And S353, if the similarity is greater than a first preset threshold value, detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information.

When the similarity is larger than a first preset threshold value, detecting whether a first voice message input by a user comprises a keyword corresponding to the mode information; and when the similarity is smaller than a first preset threshold value, the keyword detection is not carried out on the first voice information.

Specifically, when the voice information acquired by the terminal device includes a plurality of different voiceprint features, only the first voice information whose similarity with the specified voiceprint feature is greater than a first preset threshold is detected. By only detecting the voice of the user matched with the specified voiceprint feature, namely only interacting the user sending the second voice information, the interaction threshold can be improved, so that other users cannot interact, the probability of interruption by other users is reduced, and the problem that the interaction is easily interrupted due to indiscriminate recognition and response of voice signals is solved.

In some embodiments, when the similarity is greater than a first preset threshold, it may be further detected whether the first voice message includes stop response information, where the stop response information is used to indicate that a user corresponding to the second voice message stops interacting. If the stop response information is detected, the terminal device does not perform voiceprint recognition on the first voice information any more, and does not perform similarity judgment, that is, all the acquired first voice information is subjected to indiscriminate response. In this way, after the user stops interacting, the terminal device can interact with other users, thereby improving the usability of the system.

Step S360: if the detection is detected, the corresponding first control command is determined according to the keyword.

Step S370: and executing the first control instruction.

According to the voice-based intelligent control method provided by the embodiment, after second voice information input by a user is acquired, if the second voice information comprises a preset awakening word, awakening operation is executed, a scene corresponding to the second voice information is acquired, the scene is taken as a current scene, and mode information of the current scene is acquired, whether a keyword corresponding to the mode information is included in first voice information input by the user or not can be detected, when the keyword is detected to be present, a corresponding first control instruction is determined according to the keyword, and the first control instruction is executed. According to the intelligent control method based on the voice, the current scene can be obtained by identifying the voice information input by the user, the current scene is obtained under the condition that the user does not sense, the mode information based on the scene responds to the voice control command of the wake-up-free word input by the user, and the word string phenomenon can be avoided to a certain extent when the voice control commands of the wake-up-free word in various scenes exist at the same time.

Referring to fig. 5, a flowchart of a method for controlling a voice-based intelligent control method according to an embodiment of the present application is shown, and the method can be seen from fig. 5 and includes steps S410 to S470.

Step S410: and acquiring mode information of the current scene.

In some embodiments, before obtaining the mode information of the current scene, second voice information input by a user may be obtained, and if the second voice information includes a preset wake-up word, a wake-up operation is performed, where the preset wake-up word is different from the keyword, and a scene corresponding to the second voice information is obtained, and the scene is taken as the current scene.

In some embodiments, each scene includes at least one mode information, and each mode information includes at least one keyword and a first control instruction corresponding to the keyword.

Step S420: whether a keyword corresponding to the mode information is included in the first voice information input by the user is detected.

In some embodiments, after the mode information of the current scene is acquired, first voice information input by a user can be acquired, and voiceprint recognition is performed on the first voice information to acquire a target voiceprint feature to judge the similarity between the target voiceprint feature and a specified voiceprint feature, wherein the specified voiceprint feature is a voiceprint feature of the second voice information; if the similarity is larger than a first preset threshold, detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information.

When detecting that the first voice message input by the user includes the keyword corresponding to the mode information, determining the corresponding first control instruction according to the keyword, i.e. entering step S430, and when detecting that the first voice message input by the user does not include the keyword corresponding to the mode information, detecting whether the first voice message includes a preset wake-up word, i.e. entering step S450.

Step S430: and determining a corresponding first control instruction according to the keyword.

Step S440: and executing the first control instruction.

Step S450: whether the first voice message contains a preset awakening word is detected.

The preset awakening word can be an awakening word preset by default when the terminal device leaves a factory, or can be an awakening word set by a user, and the preset awakening word detection can be performed on the first voice information based on the awakening word detection model. Optionally, the preset wake-up word may be the same as or different from a wake-up word required for waking up the terminal device in the sleep state.

Step S460: if yes, performing semantic analysis on the first voice information to acquire a second control instruction;

if the first voice information is detected to contain the preset awakening word, the first voice information is converted into the text through the ASR, and NLU is executed on the text to obtain a second control instruction through semantic analysis. Optionally, when the semantic parsing of the first voice information fails to obtain the second control instruction, the response information may be output to interact with the user to further determine the control instruction intended by the user. For example, when the user inputs "AA, beautiful souls", the terminal device cannot determine that the user wants to download a game of the beautiful souls or play a movie of the beautiful souls, the terminal device may output a voice message "ask you want to play the movie of the beautiful souls", and determine a control instruction of the user's intention according to the next input of the user.

In some embodiments, after performing semantic parsing on the first voice information to obtain the second control instruction, a scene category corresponding to the second control instruction may also be obtained; and if the scene type is different from the current scene, switching the current scene into a scene corresponding to the scene type. The scene category is obtained by performing semantic analysis on the voice information, and the scene can be switched under the condition that a user does not perceive.

As a mode, the terminal device may further store preset wake-up words, control instructions, and a one-to-one correspondence relationship between the control instructions and the scenes, that is, each preset wake-up word corresponds to one control instruction, and each control instruction corresponds to one scene. Specifically, when it is detected that the first voice message includes a preset wake-up word, a control instruction corresponding to the preset wake-up word is used as a second control instruction, and a category of a scene corresponding to the second control instruction is obtained.

It should be noted that the preset wakeup word is valid in all scenes, that is, can be detected, and the keyword corresponding to each scene cannot be identified in other scenes. For example, if it is detected that the voice information input by the user includes "playing music" in the video scene, the music scene is used as the current scene category, and the current scene is switched to the music scene, and further the control can be performed by using the keyword such as "previous music" corresponding to the mode information of the music scene; and if the terminal equipment acquires the voice information containing the 'last song' in the video scene, the terminal equipment does not respond.

As one mode, the scene category may be identified according to whether the interactive scene is continuous interaction or non-continuous interaction, the identification of the scene category is acquired while the scene category is acquired, and when the scene category corresponding to the second control instruction is different from the current scene, and the identification corresponding to the current scene is continuous interaction, and the scene category corresponding to the second control instruction is non-continuous interaction, only the second control instruction may be executed, but the current scene is not switched. And only when the scene category corresponding to the second control instruction is different from the current scene and the scene category and the current scene are both persistent scenes, switching the current scene to the scene corresponding to the scene category. For example, the current scene is a music playing scene with continuous interaction, the second control instruction "turn on the light" corresponds to a non-continuous interaction intelligent device control scene, and the terminal device can still maintain the music playing scene after the light is turned on, that is, the terminal device can use the keyword corresponding to the scene for interaction, and does not need to enter the music playing scene again by inputting voice. By the method, the situation that the continuous interaction between the user and the terminal equipment is interrupted by the control instruction in the non-continuous interaction scene can be avoided, so that the scene switching is frequently required, and the interaction cost is further reduced.

Step S470: and executing the second control instruction.

The step of executing the second control instruction can be seen in the step S240 of executing the first control instruction.

In some embodiments, after the mode information of the current scene is acquired, whether the first voice information contains a preset wake-up word or not may be detected; if yes, performing semantic analysis on the first voice information to acquire a second control instruction, and executing the second control instruction; if not, detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information or not, determining a corresponding first control instruction according to the keyword when detecting that the keyword corresponding to the mode information comprises the keyword, and executing the first control instruction.

In other embodiments, after the first voice message is obtained, it may be detected whether the first voice message includes a keyword corresponding to the mode information, and whether the first voice message includes a preset wake-up word, where the preset wake-up word and the keyword are different words, and there is no case where the keyword and the wake-up word are detected successfully at the same time, if the keyword is detected, step S430 is performed, and if the preset wake-up word is detected, step S460 is performed. By detecting the preset awakening words and the keywords simultaneously, the detection efficiency of the voice input by the user can be improved, and the response can be carried out more quickly.

According to the intelligent control method based on the voice, after the mode information of the current scene is obtained, whether a first voice message input by a user comprises a keyword corresponding to the mode information or not can be detected, when the first voice message input by the user comprises the keyword, a corresponding first control instruction is determined according to the keyword, and the first control instruction is executed; and when the detection result is no, detecting whether the first voice message contains a preset awakening word, if so, performing semantic analysis on the first voice message to obtain a second control instruction, and instructing the second control instruction. According to the voice-based intelligent control method provided by the embodiment of the application, a user can directly send a control command through the keyword corresponding to the mode information of the scene, more diversified control commands can be carried out through the voice information with the awakening words, the scene can be switched according to the voice information with the awakening words, more diversified voice interaction requirements of the user can be met besides the preset keyword through the mode, and the use experience of the user is further improved.

Referring to fig. 6, a block diagram of a voice-based intelligent control device according to an embodiment of the present application is shown, where the device 600 includes: an acquisition module 610, a detection module 620, an instruction determination module 630, and an execution module 640.

The obtaining module 610 is configured to obtain mode information of a current scene.

Further, before obtaining the mode information of the current scene, the apparatus 600 further includes: the system comprises a second voice acquisition module, a wakeup execution module and a scene acquisition module.

And the second voice acquisition module is used for acquiring second voice information input by the user.

And the awakening execution module is used for executing awakening operation if the second voice information comprises a preset awakening word, wherein the preset awakening word is different from the keyword.

And the scene acquisition module is used for acquiring a scene corresponding to the second voice information and taking the scene as the current scene.

Further, each scene comprises at least one mode information, and each mode information comprises at least one keyword and a first control instruction corresponding to the keyword.

The detecting module 620 is configured to detect whether the first voice message input by the user includes a keyword corresponding to the mode information.

Further, the detecting module 620 further includes: the voice print recognition module, the similarity acquisition module and the first judgment module are connected in series.

And the voiceprint recognition submodule is used for carrying out voiceprint recognition on the first voice information so as to obtain the target voiceprint characteristics.

And the similarity obtaining module is used for judging the similarity between the target voiceprint feature and the specified voiceprint feature, wherein the specified voiceprint feature is the voiceprint feature of the second voice message.

And the first judgment sub-module is used for detecting whether the first voice information input by the user comprises the keyword corresponding to the mode information or not if the similarity is greater than a first preset threshold value.

Further, the detecting module 620 further includes: a voice analysis sub-module and a second judgment sub-module.

The voice analysis submodule is used for analyzing first voice information input by a user based on a preset acoustic model so as to acquire acoustic feature similarity of the first voice information and a keyword corresponding to the mode information;

and the second judging submodule is used for judging that the first voice information comprises a keyword corresponding to the mode information if the acoustic feature similarity is larger than a second preset threshold value.

The instruction determining module 630 is configured to determine a corresponding first control instruction according to the keyword when the determination is detected to be some cases.

And the execution module 640 is configured to execute the first control instruction.

Further, the apparatus 600 further comprises: the system comprises a wake word detection module, a semantic analysis module and a second execution module.

And the awakening word detection module is used for detecting whether the first voice message contains a preset awakening word or not when detecting that the first voice message does not have the keyword corresponding to the mode information.

And the semantic analysis module is used for performing semantic analysis on the first voice information to acquire a second control instruction if the first voice information contains the second control instruction.

And the second instruction execution module is used for executing the second control instruction.

Further, after semantically parsing the first voice information to obtain the second control instruction, the apparatus 600 further includes: the device comprises a scene acquisition module and a scene switching module.

And the scene acquisition module is used for acquiring the scene type corresponding to the second control instruction.

And the scene switching module is used for switching the current scene into a scene corresponding to the scene type if the scene type is different from the current scene.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling. In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In summary, the embodiment of the application discloses an intelligent control method, an intelligent control device, electronic equipment and a storage medium based on voice, and relates to the field of voice recognition. According to the method and the device, by acquiring the mode information of the current scene, when the first voice information is detected to comprise the keyword corresponding to the mode information, the first control instruction can be determined according to the keyword; the terminal equipment can directly execute the first control instruction in a wake-up-free state, so that the control efficiency of the terminal equipment is improved to a certain extent; the user can send out a corresponding control instruction only through the keyword, so that the problem that the user needs to speak the awakening word before sending out a voice instruction every time is avoided, and the use experience of the user is effectively improved.

An electronic device provided by the present application will be described with reference to fig. 7.

Referring to fig. 7, based on the foregoing intelligent control method, apparatus, electronic device and storage medium based on voice, another electronic device 700 capable of executing the foregoing intelligent control method based on voice is provided in an embodiment of the present application. The electronic device 700 includes one or more processors 710 (only one shown) and a memory 720 coupled to each other. The memory 720 stores programs that can execute the contents of the foregoing embodiments, and the processor 710 can execute the programs stored in the memory 720, and the memory 720 includes the devices described in the foregoing embodiments.

Processor 710 may include one or more processing cores, among other things. The processor 710 interfaces with various components throughout the electronic device 700 using various interfaces and circuitry to perform various functions of the electronic device 700 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 720 and invoking data stored in the memory 720. Alternatively, the processor 710 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 710 may integrate one or more of a Central Processing Unit (CPU), a video Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 710, but may be implemented by a communication chip.

The Memory 720 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 720 may be used to store instructions, programs, code sets, or instruction sets. The memory 720 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, a video image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The data storage area may also store data created by the electronic device 700 during use (e.g., phone books, audio-visual data, chat log data), and the like.

It will be understood by those skilled in the art that the structure shown in fig. 7 is merely an illustration and is not intended to limit the structure of the electronic device. For example, electronic device 700 may also include more or fewer components than shown in FIG. 7, or have a different configuration than shown in FIG. 7.

Refer to fig. 8, which is a block diagram illustrating a computer-readable storage medium according to an embodiment of the present application. The computer-readable medium 800 has stored therein a program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 800 includes a non-volatile computer-readable storage medium. The computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 810 may be compressed, for example, in a suitable form.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A voice-based intelligent control method, characterized in that the method comprises:

acquiring mode information of a current scene;

detecting whether first voice information input by a user comprises a keyword corresponding to the mode information;

when the detection result is that the first control instruction is detected to be some time, determining a corresponding first control instruction according to the keyword;

and executing the first control instruction.

2. The method of claim 1, wherein prior to said obtaining mode information for the current scene, the method further comprises:

acquiring second voice information input by a user;

if the second voice message comprises a preset awakening word, executing awakening operation, wherein the preset awakening word is different from the keyword;

and acquiring a scene corresponding to the second voice information, and taking the scene as the current scene.

3. The method of claim 2, wherein each scene comprises at least one mode information, and each mode information comprises at least one keyword and a first control command corresponding to the keyword.

4. The method according to claim 2, wherein the detecting whether the first voice message input by the user includes a keyword corresponding to the mode information comprises:

performing voiceprint recognition on the first voice information to acquire a target voiceprint characteristic;

acquiring the similarity between the target voiceprint feature and a specified voiceprint feature, wherein the specified voiceprint feature is the voiceprint feature of the second voice message;

and if the similarity is greater than a first preset threshold value, detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information.

5. The method according to any one of claims 1 to 4, wherein the detecting whether the first voice message input by the user includes a keyword corresponding to the mode information includes:

analyzing the first voice information input by a user based on a preset acoustic model to acquire acoustic feature similarity of the first voice information and a keyword corresponding to the mode information;

and if the acoustic feature similarity is larger than a second preset threshold, judging that the first voice message comprises a keyword corresponding to the mode information.

6. The method of claim 1, further comprising:

when detecting that the first voice message does not have the keyword corresponding to the mode information, detecting whether the first voice message contains the preset awakening word;

if yes, performing semantic analysis on the first voice information to acquire a second control instruction;

and executing the second control instruction.

7. The method of claim 6, wherein after the semantically parsing the first voice information to obtain a second control instruction, the method further comprises:

acquiring a scene type corresponding to the second control instruction;

and if the scene type is different from the current scene, switching the current scene into a scene corresponding to the scene type.

8. A voice-based intelligent control apparatus, the apparatus comprising:

the acquisition module is used for acquiring the mode information of the current scene;

the detection module is used for detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information;

the instruction determining module is used for determining a corresponding first control instruction according to the keyword when the detection result is that the keyword exists;

and the execution module is used for executing the first control instruction.

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 7.