CN112201246B

CN112201246B - Intelligent control method and device based on voice, electronic equipment and storage medium

Info

Publication number: CN112201246B
Application number: CN202011308093.6A
Authority: CN
Inventors: 何海亮
Original assignee: Shenzhen Oribo Technology Co Ltd
Current assignee: Shenzhen Oribo Technology Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2023-11-28
Anticipated expiration: 2040-11-19
Also published as: CN112201246A

Abstract

The embodiment of the application discloses an intelligent control method, a device, electronic equipment and a storage medium based on voice, which relate to the field of voice recognition, and the method comprises the following steps: acquiring mode information of a current scene; detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information; when the detection is in some cases, determining a corresponding first control instruction according to the keywords; and executing the first control instruction. According to the embodiment of the application, by acquiring the mode information of the current scene, when the first voice information comprises the keywords corresponding to the mode information, the first control instruction can be determined according to the keywords, so that a user can input the control instruction through the keywords corresponding to the mode information, the situation that the user needs to speak the wake-up word first when sending the voice instruction each time is avoided, and the use experience of the user is effectively improved.

Description

Intelligent control method and device based on voice, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech-based intelligent control method, apparatus, electronic device, and storage medium.

Background

Along with the gradual development of voice recognition technology, more and more terminal devices have voice interaction function, and the terminal devices can realize the control of the terminal devices through understanding the information of the user voice. However, the current terminal device generally requires that a user input a preset wake-up word to wake up the device, and then perform interaction between the user and the terminal device. When a user needs to input a control instruction each time, the user needs to input a wake-up word first, so that the interaction cost is increased, and the interaction experience of the user is reduced. How to optimize the voice interaction process to improve the voice interaction experience of the user is a problem to be solved.

Disclosure of Invention

In view of the above, the present application provides a voice-based intelligent control method, apparatus, electronic device and storage medium, so as to improve the above-mentioned problems.

In a first aspect, an embodiment of the present application provides a voice-based intelligent control method, where the method includes: acquiring mode information of a current scene; detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information; when the detection is in some cases, determining a corresponding first control instruction according to the keywords; and executing the first control instruction.

Further, before acquiring the mode information of the current scene, the method includes: acquiring second voice information input by a user; if the second voice information comprises a preset awakening word, executing awakening operation, wherein the preset awakening word is different from the keyword; and acquiring a scene corresponding to the second voice information, and taking the scene as a current scene.

Further, each scene includes at least one mode information, and each mode information includes at least one keyword and a first control instruction corresponding to the keyword

Further, detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information comprises the steps of carrying out voiceprint recognition on the first voice information to obtain target voiceprint characteristics; obtaining similarity between target voiceprint features and appointed voiceprint features, wherein the appointed voiceprint features are voiceprint features of second voice information; if the similarity is larger than a first preset threshold value, detecting whether the first voice information input by the user comprises keywords corresponding to the mode information.

Further, detecting whether the first voice information input by the user includes a keyword corresponding to the mode information includes: analyzing first voice information input by a user based on a preset acoustic model to obtain the acoustic feature similarity of keywords corresponding to the first voice information and the mode information; if the acoustic feature similarity is greater than a second preset threshold, judging that the first voice information comprises keywords corresponding to the mode information.

Further, when detecting that the first voice information does not have the keyword corresponding to the mode information, detecting whether the first voice information contains a preset wake-up word; if the first voice information is included, carrying out semantic analysis on the first voice information to obtain a second control instruction; and executing the second control instruction.

Further, after performing semantic parsing on the first voice information to obtain a second control instruction, the method includes: acquiring a scene category corresponding to the second control instruction; if the scene category is different from the current scene, switching the current scene into a scene corresponding to the scene category.

In a second aspect, an embodiment of the present application provides a voice-based intelligent control device, including: the acquisition module is used for acquiring the mode information of the current scene; the detection module is used for detecting whether the first voice information input by the user comprises keywords corresponding to the mode information; the instruction determining module is used for determining a corresponding first control instruction according to the keyword when the detection is in time; and the execution module is used for executing the first control instruction.

In a third aspect, the present application provides an electronic device, comprising: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of the first aspect described above.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein program code which is callable by a processor to perform the method of the first aspect described above.

The embodiment of the application discloses an intelligent control method, an intelligent control device, electronic equipment and a storage medium based on voice, which can acquire the current scene of terminal equipment, and a user can send a control instruction in the current scene only through keywords, so that the situation that the user needs to speak a wake-up word first every time the user sends a voice instruction is avoided. Specifically, mode information of a current scene is obtained, wherein the mode information is configuration information for voice interaction in the current scene; detecting whether the first voice information input by the user comprises keywords corresponding to the mode information, wherein each keyword can correspond to a control instruction of the current scene, and the terminal equipment can acquire the control instruction of the user by detecting the keywords; when the detection is in some cases, determining a corresponding first control instruction according to the keywords; and executing the first control instruction. Therefore, the terminal equipment can acquire the control instruction input by the user through detecting the keywords, but not only can identify the control instruction in the voice after detecting the wake-up word, and the user can send out the corresponding control instruction through the keywords, so that the situation that the user needs to speak the wake-up word first when sending out the voice instruction every time is avoided, and the use experience of the user is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a schematic diagram of an application environment suitable for use with embodiments of the present application.

Fig. 2 is a flowchart of a method for intelligent control method based on voice according to an embodiment of the present application.

Fig. 3 is a flowchart of a method for intelligent control method based on voice according to another embodiment of the present application.

Fig. 4 is a flowchart illustrating step S350 in a voice-based intelligent control method according to another embodiment of the present application.

Fig. 5 shows a method flowchart of a voice-based intelligent control method according to still another embodiment of the present application.

FIG. 6 shows a block diagram of a voice-based intelligent control device according to an embodiment of the present application;

fig. 7 shows a block diagram of an electronic device according to an embodiment of the present application.

Fig. 8 illustrates a memory unit for storing or carrying program codes for implementing the voice-based intelligent control method according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

With the continuous development of voice interaction technology, more and more terminal devices can interact by recognizing user voice data. In order to reduce false wake-up, one voice wake-up is usually adopted to correspond to one voice recognition in voice interaction, namely, a user needs to input a preset wake-up word each time when inputting a voice control command, and when the voice input by the user is matched with the preset wake-up word, the terminal equipment performs voice recognition on the voice input by the user to acquire the control command corresponding to the voice. For example, when a user wants to interact with a device, he needs to speak "AA" first to wake up the terminal device in sleep state, then speak the control command "please play music", or speak the control command "AA" with wake-up word, please play music ", and if he directly speaks" please play music ", he cannot recognize the voice of the user. Therefore, each time the user needs to input a control instruction, the user needs to input a wake-up word first, so that the interactive experience of the user is reduced.

In order to solve the above problems, the inventors have long studied and proposed an intelligent control method, device, electronic device and storage medium based on voice in the embodiments of the present application. The embodiment of the application acquires the mode information of the current scene, detects whether the first voice information input by the user comprises the keyword corresponding to the mode information, and when the first voice information is detected to be the keyword, determines a corresponding first control instruction according to the keyword, and executes the first control instruction. According to the embodiment of the application, the mode information of the current scene can be obtained, and when the first voice information comprises the keywords corresponding to the mode information, the first control instruction is determined according to the keywords, so that a user can send the corresponding control instruction only through the keywords, the situation that the user needs to speak the wake-up word first when sending the voice instruction each time is avoided, and the use experience of the user is effectively improved. The terminal equipment can directly execute the first control instruction in the wake-up-free state, so that the control efficiency of the terminal equipment is improved to a certain extent.

In order to better understand the intelligent control method, the device, the electronic equipment and the storage medium based on the voice provided by the embodiment of the application, an application environment suitable for the embodiment of the application is described below.

Referring to fig. 1, fig. 1 shows a schematic view of an application environment suitable for an embodiment of the present application. The virtual try-out method provided by the embodiment of the application can be applied to the multi-state interactive system 100 shown in fig. 1. The multi-state interactive system 100 includes a terminal device 101 and a server 102, and the server 102 is communicatively connected to the terminal device 101. The server 102 may be a separate server, a server cluster, a local server, or a cloud server, which is not specifically limited herein.

The terminal device 101 may be various electronic devices with voice interaction means, including but not limited to smart home devices, smart stereo, smart gateway, robots, vehicle-mounted devices, smart phones, tablet computers, laptop computers, desktop computers, wearable electronic devices, and the like. Specifically, the terminal device 101 may include a voice input module such as a microphone, a voice output module such as a speaker, and a processor. The voice interaction means may be built into the terminal device 101 or may be a separate module, communicating with the terminal device 101 via an API or other means. As one way, the terminal device 101 may be further provided with a character input module for inputting characters, an image input module for inputting images, a video input module for inputting videos, and the like, to achieve multi-modal interaction.

The terminal device 101 may be provided with a client application, a user may communicate with the server 102 based on the client application, specifically, the server 102 is provided with a corresponding server application, the user may register a user account with the server 102 based on the client application, and communicate with the server 102 based on the user account, for example, the user logs in the user account in the client application and inputs text information, voice information, image information or video information through the client application based on the user account, after receiving information input by the user, the client application may send the information to the server 102, so that the server 102 may receive the information, process and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 101 according to the information.

In some embodiments, after acquiring the reply information corresponding to the information input by the user, the terminal device 101 may display the reply information on a display screen of the terminal device 101 or other image output device connected thereto. As one way, the reply information may be displayed, and at the same time, corresponding audio may be played through a speaker of the terminal device 101 or other audio output devices connected thereto, or characters or graphics corresponding to the reply information may be displayed on a display screen of the terminal device 101, so as to implement polymorphic interaction with the user in multiple aspects such as images, voices, characters, and the like.

In some embodiments, the server 102 may provide recognition services to the voice data received at the terminal device 101 to obtain a textual representation of the user-input voice data and to obtain a user-intended representation based on the textual representation, thereby generating corresponding control instructions that are returned to the terminal device 101. The terminal device 101 provides a service to the user according to the corresponding operation of the control instruction. Such as playing songs, making a call, setting an alarm clock, etc.

In some embodiments, the terminal device 101 may be a terminal device such as a smart sound device, a smart gateway, a smart home panel, and the like, and the terminal device may be connected to at least one controlled device, where the controlled device may include, but is not limited to, a smart home device such as an air conditioner, a floor heating device, a fresh air device, a curtain, a lamp, a television, a refrigerator, an electric fan, and the like, and the terminal device and the smart home device may be connected by bluetooth, wiFi, zigBee, or the like. After acquiring the information input by the user, the terminal device 101 identifies the information input by the user, and when determining that the information is a control instruction of the controlled device, controls the controlled device to respond according to the control instruction. For example, the intelligent gateway is connected with the curtain, the user inputs a voice command "AA", the curtain is opened ", and the intelligent gateway recognizes the voice command after detecting the wake-up word" AA "and controls the curtain to be opened.

In some embodiments, the means for processing the information input by the user may also be provided on the terminal device 101, so that the terminal device 101 may implement interaction with the user without relying on establishing communication with the server 102, where the polymorphic interaction system 100 may only include the terminal device 101.

The above application environments are merely examples for facilitating understanding, and it is to be understood that embodiments of the present application are not limited to the above application environments.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart of a method for voice-based intelligent control according to an embodiment of the application, and the method includes steps S210 to S240.

Step S210: and acquiring the mode information of the current scene.

The mode information is configuration information for voice interaction corresponding to the current working scene of the terminal equipment. Specifically, the mode information corresponding to each scene may include one or more of a keyword usable as a control command, an acoustic detection model for detecting voice information, a voice signal strength threshold for responding, a time period for waiting for a user to input voice information, and the like.

The terminal device may store at least one scene and mode information corresponding to each scene, where the corresponding relationship between the scene and the mode information may be preset by a user and then stored in the terminal device or the server, or may be preset and stored by default when the terminal device leaves the factory, or may be preset by the server and then sent to the terminal device, and is not limited herein.

In some embodiments, multiple scenarios may be set for the terminal device according to the requirements of the actual use of the user. As one way, multiple scenes may be set according to different functions. For example, the scene is set as an application scene such as music playing, news playing, shopping, ticket booking, intelligent home equipment control and the like. As another approach, multiple scenes may be set according to different interactive objects. For example, scenes may be set as a multi-person interaction scene and a single person interaction scene according to the number of interaction objects, or may be set as an interaction scene for aged, children, adults, and the like according to the nature of the interaction objects. As still another way, a plurality of work scenes may also be set according to external environment information. For example, the scenes may be set as usage scenes such as an early morning scene, an evening scene, a night scene, and the like according to time. The setting of the scene is not limited here.

In some embodiments, one mode information may be set for each scene, and when the terminal device operates in the scene, voice interaction is performed based on the mode information corresponding to the scene. For example, keywords that can be used as control commands in mode information of different scenes can be different, so that the corresponding control commands in the scenes can be executed by inputting the keywords through voices in the different scenes. For another example, the speech signal intensity threshold value that can be responded in the mode information of different scenes may be different, specifically, a higher speech signal intensity threshold value may be set in the mode information of a multi-person interaction scene to prevent false wake-up, and a lower speech signal intensity threshold value may be set in the mode information of a single person interaction scene to improve the sensitivity of the response.

In some embodiments, each scene includes at least one pattern information, and each pattern information includes at least one keyword corresponding to a control instruction for controlling. As a way, for the scene of discontinuous dialogue, after the equipment is awakened and the voice wait exceeds the preset time, the control instruction input by the user can be still obtained by detecting the keywords, the voice stream input by the user is not required to be sent to the server, and the voice recognition and semantic analysis are carried out on the whole section of voice spoken by the user to obtain the control instruction input by the user intention. Therefore, for a discontinuous dialogue scene, command control without wake-up words can be realized while power consumption is reduced.

As a mode, a plurality of mode information can be set for each scene, and by further subdividing the current working scene, a plurality of different modes for carrying out voice interaction with a user can be realized in the same working scene. For example, for a plurality of scenes obtained according to the function setting, a plurality of pattern information may be set for each scene according to different interactive objects, each pattern information corresponding to a different keyword. For example, the scene of intelligent home equipment control may include different mode information such as children and adults, the equipment with control authority corresponding to the different mode information is different, the keywords corresponding to the different mode information are also different, specifically, the user may set that the children are not allowed to use entertainment facilities such as televisions and projectors, the mode information of the children does not include the keywords "turn on television" which can be used as control commands, the mode of the people has the control authority of the entertainment facilities, and the keywords corresponding to the mode of the adults include "turn on television".

According to the embodiment of the application, by acquiring the mode information of the scene, the voice interaction in the scene can be performed according to different mode information in different scenes, so that the voice interaction experience of a user is improved. Further, the terminal device may also obtain the mode information of the current scene in a plurality of ways.

In some embodiments, the terminal device may acquire the mode information of the current scene by acquiring current environmental information, where the current environmental information may be illumination information acquired by a sensor of the terminal device, current time information acquired by a clock of the terminal device, or noise information of the current scene acquired by the sound acquisition device.

In other embodiments, the terminal device may obtain the mode information of the current scene by obtaining the voice information input by the user, specifically, please refer to the subsequent examples.

In still other embodiments, a selection control may be configured on the terminal device, and as a way, the user may directly select the current working scene through the selection control of the working scene, so as to obtain mode information corresponding to the scene. Alternatively, the user may also select, via the mode selection control, in which manner to obtain mode information for the current scene. For example, when a user selects a voice mode, the terminal device acquires mode information of a current scene by acquiring voice information input by the user; when the user selects the voice mode and the environment mode, the mode information of the current scene may be acquired in combination with the acquired voice information and the current environment information. For example, when the user selects the voice mode and the environment mode, when the user inputs "AA, plays a song bar of a juveniles", recognizes that the user intends to play music, and obtains the current time as night by a clock, mode information corresponding to the music played at night can be obtained.

Step S220: and detecting whether the first voice information input by the user comprises keywords corresponding to the mode information.

After the mode information of the current scene is acquired, the terminal equipment can continuously acquire the voice signal input by the user through the voice acquisition module. As one way, a voice detection module may be provided in the terminal device, which detects the voice signal input by the user and obtained by the voice acquisition module such as a microphone, using a voice activity detection technique (Voice activity detection, VAD). Optionally, the integrity of the acquired voice signal can also be determined through time delay compensation after the voice signal is detected, so that part of the voice signal is prevented from being missed.

In some embodiments, after the first voice information input by the user is obtained, a certain preprocessing operation may be performed on the first voice information, and then detection is performed to determine whether the first voice information includes a keyword corresponding to the mode information. The preprocessing operation may include noise suppression processing, echo cancellation processing, signal enhancement processing, and the like, and the accuracy of detection may be improved through the preprocessing operation.

Specifically, by acquiring the mode information of the current scene, the terminal device may detect only the keyword corresponding to the mode information of the current scene. It will be appreciated that the greater the distinction between keywords, the greater the accuracy of keyword detection. Among them, it may be represented as a length of a keyword, a difference between syllables included in each keyword, and the like. In this way, the number of keywords to be detected in each scene is smaller, so that the required power consumption is smaller, and the recognition error rate caused by smaller distinguishability among the keywords is also reduced to a certain extent due to the reduced number of keywords to be detected.

In some embodiments, each scene may include a plurality of mode information, and the mode information corresponding to the current scene may be determined according to the acquired first voice information input by the user. For example, in a scenario of smart home device control, multiple types of mode information may be set according to different interaction objects, each type of mode information corresponds to different keywords, a current interaction object may be determined by identifying voice information, and then a keyword corresponding to the current mode information is determined according to the interaction object, so as to detect whether the first voice information contains the keyword. For example, the scene of intelligent home equipment control may include mode information of children, adults and the like, where the mode information of children does not have control authority of entertainment facilities, that is, does not have keywords related to the entertainment facilities, and does not detect the keywords related to the entertainment facilities such as "on television" when the interactive object is determined to be children by detecting the acquired voice.

The method for detecting whether the first voice information input by the user includes the keyword corresponding to the mode information may be implemented in various manners, for example, using an acoustic model, matching based on a template, using a neural network, and the like. The embodiment of the present application is not limited thereto.

In some embodiments, the first voice information input by the user may be analyzed based on a preset acoustic model, so as to obtain an acoustic feature similarity of the keyword corresponding to the first voice information and the mode information, determine whether the first voice information includes the keyword corresponding to the mode information according to the acoustic feature similarity, and if the acoustic feature similarity is greater than a second preset threshold, determine that the first voice information includes the keyword corresponding to the mode information.

The second preset threshold is a preset value, and the higher the second preset threshold is, the higher the accuracy of keyword detection is, and the lower the response sensitivity of the terminal equipment is; the lower the second preset threshold value is, the lower the accuracy of keyword detection is, and accordingly, the higher the response sensitivity of the terminal device is. For example, when the keyword is "cut song", it is determined that the keyword is included in the voice information only if the user clearly and accurately speaks two words when the second preset threshold is high, and since the second preset threshold is high, in some cases, even if the user speaks the keyword, the similarity of the acoustic feature may be low due to accent or noise, and it is determined that the keyword is not included; when the second preset threshold is low, some similar voice information may be determined to include the keyword, for example, "qigo" may be misdetermined to include "cut song", and "cut song" voice with accent may be determined to include "cut song".

It may be appreciated that, since the keywords in the mode information of different scenes may be different, the modes of different scenes may correspond to different acoustic models, where each acoustic model includes an acoustic feature corresponding to each keyword in the mode information of the scene.

As one way, the threshold of the intensity of the voice signal corresponding to the mode information of different scenes may be different, and when the intensity of the acquired voice signal input by the user is greater than the threshold of the intensity corresponding to the scene, the keyword is detected through the acoustic model corresponding to the scene. For example, a higher voice signal strength threshold value can be set in the mode information of the multi-person interaction scene to reduce the probability of the terminal device being awakened by mistake in a noisy environment, and a lower voice signal strength threshold value can be set in the mode information of the single person interaction scene to improve the response sensitivity.

As one approach, the signal strength thresholds corresponding to the individual keywords in each acoustic model may be the same. As another way, each keyword in the acoustic model corresponds to a different signal intensity threshold, specifically, a higher signal intensity threshold may be set for a keyword with poor distinguishability, a lower signal intensity threshold may be set for a keyword with good distinguishability, and a lower false wake-up probability and a higher response sensitivity may be further considered in each scene.

The acoustic model may be a model stored in the terminal device or a model in the server.

As a way, when the acoustic model is pre-stored in the terminal device, the terminal device can perform voice recognition on the first voice information without communicating with the server, and when the terminal device is in the condition of poor network signal or network disconnection, the detection of the keyword can still be realized. Specifically, the acoustic features in the first voice information may be extracted based on the acoustic model, and the similarity between the acoustic features of the first voice information and the acoustic features corresponding to the keywords may be calculated. For example, the mel-frequency cepstrum coefficient extracted from the first speech signal may be used as the acoustic feature, and the maximum likelihood ratio of the acoustic feature between the keywords corresponding to the first speech information and the pattern information may be used as the acoustic feature similarity. Specifically, each feature point of the acoustic feature in the first voice information may be obtained, similarity comparison is performed on each feature point of the acoustic feature corresponding to the keyword, and then the similarity of all feature points is synthesized to obtain a maximum likelihood value as the acoustic feature similarity.

As another way, a first acoustic model may be set in the terminal device, and a second acoustic model may be set in the server, where the detection accuracy of the first acoustic model is smaller than that of the second acoustic model. And when the similarity of the acoustic characteristics of the keywords corresponding to the first voice information and the mode information is judged to be larger than a specified value based on the first acoustic model, the first voice information is sent to the second acoustic model for further detection. In this way, only when the similarity of the acoustic features detected by the first acoustic model is greater than a specified value, acoustic detection with higher precision can be performed, so that power consumption required by accurate detection can be reduced, and the accuracy of detection can be improved.

Step S230: and when the detection is in some cases, determining a corresponding first control instruction according to the keywords.

The first control instruction may be an instruction for controlling the terminal device that obtains the first voice information, or an instruction for controlling another controlled device connected to the terminal device. Specifically, the control instruction may include the controlled device and the business skills of the controlled device, and the business skills may be control skills, query skills, and the like according to different scenes. The first control command is not limited herein. The user can directly input the first control instruction through the key words corresponding to the mode information by voice, and the wake-up words are not required to be input first to wake up the terminal equipment every time the control instruction is required to be input, and then the control instruction in the voice information of the user is identified.

In some embodiments, the first control instruction may be determined only by a keyword. Specifically, the corresponding relation between the keywords and the control instructions in the mode information of each scene is stored in the terminal equipment, and the corresponding relation can be preset by a user and then stored in the terminal equipment, or can be stored in the local or server of the terminal equipment after default setting when the terminal equipment leaves a factory. According to the detected keyword corresponding to the mode information and the corresponding relation between the keyword and the control instruction stored in the terminal device, the terminal device can determine the first control instruction corresponding to the keyword in the first voice information input by the user. By the method, the terminal equipment can acquire the control instruction corresponding to the first voice information under the condition that the first voice information is not subjected to semantic analysis, so that the efficiency of determining the first control instruction is improved. Therefore, even if the terminal device is located in a poor network environment or a network disconnection environment, the first control instruction can be determined according to the keyword.

As a way, different keywords may correspond to the same control instruction, i.e. the same control instruction may correspond to a plurality of different keywords having similar semantics, by which the user may use different keywords to implement the same control instruction. For example, in a scene of music playing, the keywords corresponding to the control instruction of switching the next song may be "cut song" and "next song". It should be noted that the same keyword in different scenes may correspond to different control instructions. For example, the control instruction corresponding to the keyword "turn on" in the working scene is an intelligent desk lamp for turning on a desk, and the control instruction corresponding to the keyword "turn on" in the entertainment scene is a color atmosphere lamp for turning on a room.

In other embodiments, after detecting that the first voice information includes the keyword corresponding to the mode information, the terminal device may perform semantic recognition on the first voice information to obtain a first control instruction corresponding to the keyword. Specifically, after the first voice information is converted into text by an automatic voice recognition technology (ASR, automatic Speech Recognition), a natural voice understanding operation ((Natural Language Understanding, NLU)) is performed on the text to achieve parsing of the first voice information, and a first control instruction is determined according to the result of the parsing.

As a way, the terminal device may be provided with a multi-round interaction mode, and by turning on the multi-round interaction mode, the first control instruction may be determined by combining the keyword and the interaction content of the terminal device before the first voice information is acquired. The multi-round interaction mode is a continuous dialogue mode which can continuously acquire user voice and respond. For example, when the current scene is determined to be the scene of playing video through the acquired voice information "i want to see the variety", the keyword is set to "play", if the voice of "play small-scale of actor a" is acquired at this time, the control instruction is determined to be the small-scale of play actor a through recognition, the instruction is executed, and then when the voice of "play actor B" is acquired, the terminal device may determine that the current scene is the small-scale of play actor B in combination with the previous interactive content, instead of the variety of play actor B. In this way, the user does not need to repeatedly input the previous voice information, and the interaction efficiency can be improved.

Step S240: and executing the first control instruction.

In some embodiments, the first control instruction may be an instruction for controlling the terminal device that acquires the first voice information, and the terminal device may directly execute the first control instruction. For example, the terminal device is an intelligent sound device, the first control instruction is to play music, and the terminal device directly executes the first control instruction to play music.

In other embodiments, the first control instruction may also be an instruction for controlling other controlled devices connected to the terminal device, where the other controlled devices may be devices locally connected to the terminal device by bluetooth, wiFi, zigBee, or other manners, or may be WiFi devices connected to the terminal device under the same WiFi. The terminal device may send the first control instruction to the controlled device corresponding to the first control instruction, and instruct the controlled device to execute the first control instruction.

In some embodiments, the terminal device may further obtain an execution result of the first control instruction, and output response information based on the execution result. The response information may be at least one of sound, image or sound-light combination. For example, in the smart home control scene, the user inputs voice information including a keyword of "turning on the air conditioner" to the control panel, the control panel determines that a control instruction corresponding to the keyword is sent to the air conditioner, after the air conditioner performs corresponding operation, the execution result can be fed back to the control panel, and the control panel can inform the user that the air conditioner is turned on successfully in a voice prompt mode, or can also inform the user that the air conditioner is turned on successfully in a vibration, light flashing and the like. As one way, when the first control instruction fails to execute, response information may be output to feed back the execution failure to the user. For example, when the air conditioner is not connected to the control panel, the user may be informed that the air conditioner is not connected and cannot execute the control command input by the user.

It should be noted that, in this embodiment, the portions not described in detail may refer to the foregoing embodiments, and are not described herein again.

According to the intelligent control method based on the voice, after the mode information of the current scene is acquired, whether the first voice information input by the user comprises keywords corresponding to the mode information or not can be detected, and when the first voice information is detected to be sometimes, a corresponding first control instruction is determined according to the keywords, and the first control instruction is executed. The intelligent control method based on the voice provided by the embodiment of the application can detect the keywords corresponding to the mode information in the voice information, determine the control instruction according to the keywords, and directly execute the first control instruction in the wake-up-free state by the terminal equipment, so that the control efficiency of the terminal equipment is improved to a certain extent, and the user can send the corresponding control instruction through the keywords, thereby avoiding the need of speaking the wake-up word by the user when sending the voice instruction each time, and effectively improving the use experience of the user.

Referring to fig. 3, a flowchart of a method for intelligent voice-based control according to an embodiment of the present application is shown, and the method includes steps S310 to S370 as can be seen from fig. 3.

Step S31O: and acquiring second voice information input by the user.

In the sleep state, the voice acquisition module of the terminal device can continuously acquire external voice, and when the user sends out the second voice information, the terminal device acquires the second voice information input by the user. The terminal equipment can be switched to the dormant state when no interaction behavior exists in the preset waiting time after the user interacts with the terminal equipment, and most of functional modules in the dormant state are in a state of stopping working, so that the consumed power is low.

Step S320: if the second voice information comprises a preset wake-up word, executing a wake-up operation.

After the second voice information input by the user is acquired, the preset wake-up word detection can be performed on the second voice information based on the wake-up word detection model. The preset wake-up word can be a wake-up word preset by the terminal equipment when leaving the factory, or can be a wake-up word set by a user, and the preset wake-up word and the keyword are different words. For example, the preset wake-up word may be a word such as "AA" which is irrelevant to the control command, and the keyword may be a word such as "turn on a lamp", "turn on an air conditioner", etc. which characterizes the control intention of the user. When the second voice information comprises a preset wake-up word, the terminal equipment in the dormant state can be switched to the wake-up state, wherein the acquired voice information can be further identified in the wake-up state.

In some embodiments, the terminal device may be detected based on a wake-up word detection model local to the terminal device, so as to save power consumption required by the terminal device to detect voice in real time in a sleep state. Specifically, after the acoustic features of the second voice information are extracted based on the wake-up word detection model, the similarity of the acoustic features of the second voice information and the preset wake-up word is calculated, and then whether the second voice information includes the preset wake-up word or not is judged. When the acoustic feature similarity between the second voice information and the acoustic feature is greater than a preset wake-up threshold, it may be determined that the second voice information includes a preset wake-up word, and the terminal device executes a wake-up operation.

It should be noted that, the implementation principle of the wake-up word detection model used for performing the preset wake-up word detection and the acoustic model used for acquiring the acoustic feature similarity of the keyword corresponding to the first voice information and the mode information in step S220 are the same, but different is that the wake-up word detection model of the terminal device is the same model in the sleep state and in different scenes, and is used for continuously performing the detection of the preset wake-up word on the acquired voice information; the acoustic model in step S220 is a different model under the mode information of a different scene, and only when the terminal device is in the scene, the detection of the keyword is performed based on the acoustic model corresponding to the scene.

Step S330: and acquiring a scene corresponding to the second voice information, and taking the scene as a current scene.

In some embodiments, the acquired second voice information input by the user may be subjected to voice recognition and semantic analysis, so as to acquire a scene corresponding to the second voice information, and the scene is used as the current scene.

As a way, the correspondence between the rule template and the scene may be preset, after the terminal device collects the second voice information input by the user, the second voice information may be sent to the server, where the server converts the voice of the second voice information into text through ASR, and executes NLU on the text to obtain the scene corresponding to the second voice information. Specifically, the scene corresponding to the second voice information can be identified by matching with the rule template, text classification, information extraction and the like.

Step S340: and acquiring the mode information of the current scene.

Step S350: and detecting whether the first voice information input by the user comprises keywords corresponding to the mode information.

Referring to fig. 4, in some embodiments, step S350 may include steps S351 to S353.

And S351, carrying out voiceprint recognition on the first voice information to acquire target voiceprint features.

Different voice messages correspond to different voiceprint features, and each voiceprint feature corresponds to a user to be responded. Alternatively, voiceprint feature recognition may be performed on the acquired speech based on the trained neural network model, and the manner in which the voiceprint recognition is performed is not limited herein.

In one manner, when the first voice information acquired by the terminal device includes only one voiceprint feature, the voiceprint feature is taken as the target voiceprint feature. As another way, when the voice information acquired by the terminal device includes a plurality of different voiceprint features, a plurality of first voice information and a target voiceprint feature corresponding to each first voice information may be acquired.

And step S352, obtaining the similarity between the target voiceprint feature and the appointed voiceprint feature.

Wherein the specified voiceprint feature can be a voiceprint feature of the second voice message. The terminal device may match the target voiceprint feature with the specified voiceprint feature to obtain a similarity between the target voiceprint feature and the specified voiceprint feature. As one way, when the voice information acquired by the terminal device includes a plurality of different target voiceprint features, the similarity between each target voiceprint feature and the specified voiceprint feature is calculated respectively.

In some embodiments, the specified voiceprint feature may be a preset voiceprint feature, and by storing at least one preset voiceprint feature of the user in advance as the preset voiceprint feature, only the voice information of the preset user can be responded, so that other users cannot interact at will, and further the identified power consumption is saved.

Step S353, if the similarity is greater than a first preset threshold, detecting whether the first voice information input by the user includes a keyword corresponding to the mode information.

When the similarity is larger than a first preset threshold value, detecting whether the first voice information input by the user comprises keywords corresponding to the mode information or not; and when the similarity is smaller than a first preset threshold value, keyword detection is not performed on the first voice information.

Specifically, when the voice information acquired by the terminal equipment contains a plurality of different voiceprint features, only the first voice information with the similarity with the designated voiceprint features being larger than a first preset threshold value is detected. By detecting only the voice of the user matched with the appointed voiceprint features, namely only the user sending the second voice information is interacted, the interaction threshold can be improved, other users cannot interact, the probability of being interrupted by other users is reduced, and the problem that the interaction caused by indiscriminate recognition and response to the voice signals is easily intercepted is solved.

In some embodiments, when the similarity is greater than a first preset threshold, it may further be detected whether the first voice information includes a stop response message, where the stop response message is used to indicate that the user corresponding to the second voice information stops interacting. If the stopping response information is detected, the terminal equipment does not carry out voiceprint recognition on the first voice information, does not carry out similarity judgment, and namely carries out indiscriminate response on all the acquired first voice information. In this way, after the user stops interacting, the terminal device can interact with other users, thereby improving the usability of the system.

Step S360: and when the detection is in some cases, determining a corresponding first control instruction according to the keywords.

Step S370: and executing the first control instruction.

According to the intelligent control method based on the voice, after the second voice information input by the user is obtained, if the second voice information comprises the preset awakening word, the awakening operation is executed, a scene corresponding to the second voice information is obtained, the scene is taken as a current scene, after the mode information of the current scene is obtained, whether the first voice information input by the user comprises the keyword corresponding to the mode information or not can be detected, and when the keyword is detected, a corresponding first control instruction is determined according to the keyword, and the first control instruction is executed. According to the intelligent control method based on the voice, the current scene can be obtained by identifying the voice information input by the user, the current scene is obtained under the condition that the user does not feel, the mode information based on the scene is in response to the voice control command without wake-up words input by the user, and the phenomenon of word strings when the voice control commands without wake-up words of various scenes exist at the same time can be avoided to a certain extent.

Referring to fig. 5, a flowchart of a method for intelligent voice-based control according to an embodiment of the present application is shown, and the method includes steps S410 to S470 as can be seen from fig. 5.

Step S410: and acquiring the mode information of the current scene.

In some embodiments, before the mode information of the current scene is acquired, second voice information input by the user may be acquired, and if the second voice information includes a preset wake-up word, a wake-up operation is performed, where the preset wake-up word is different from the keyword, and a scene corresponding to the second voice information is acquired, and the scene is taken as the current scene.

In some embodiments, each scene includes at least one pattern information, and each pattern information includes at least one keyword and a first control instruction corresponding to the keyword.

Step S420: and detecting whether the first voice information input by the user comprises keywords corresponding to the mode information.

In some embodiments, after the mode information of the current scene is acquired, first voice information input by a user may be acquired, and voiceprint recognition is performed on the first voice information to acquire a target voiceprint feature, so as to determine similarity between the target voiceprint feature and a designated voiceprint feature, where the designated voiceprint feature is a voiceprint feature of the second voice information; if the similarity is larger than a first preset threshold value, detecting whether the first voice information input by the user comprises keywords corresponding to the mode information.

When it is detected that the first voice information input by the user includes a keyword corresponding to the mode information, determining a corresponding first control instruction according to the keyword, that is, entering step S430, and when it is detected that the first voice information input by the user does not include the keyword corresponding to the mode information, detecting whether the first voice information includes a preset wake-up word, that is, entering step S450.

Step S430: and determining a corresponding first control instruction according to the keywords.

Step S440: and executing the first control instruction.

Step S450: and detecting whether the first voice information contains a preset wake-up word.

The preset wake-up word may be a wake-up word preset by the terminal device when leaving the factory, or may be a wake-up word set by the user, and the preset wake-up word detection may be performed on the first voice information based on the wake-up word detection model. Optionally, the preset wake-up word may be the same as or different from the wake-up word required for waking up the terminal device in the sleep state.

Step S460: if the first voice information is included, carrying out semantic analysis on the first voice information to obtain a second control instruction;

if the first voice information is detected to contain the preset wake-up word, converting the first voice information into a text through ASR, and executing NLU on the text to acquire a second control instruction through semantic analysis. Optionally, when the second control instruction cannot be obtained by performing semantic analysis on the first voice information, the response information may be output to interact with the user, so as to further determine the control instruction intended by the user. For example, when the user inputs "AA, the ghost of the female", the terminal device cannot determine whether the user wants to download the game of the ghost of the female, or play the movie of the ghost of the female, the voice information "please ask you for the movie of the ghost of the female", and determine the control instruction of the user's intention according to the user's next input.

In some embodiments, after the semantic analysis is performed on the first voice information to obtain the second control instruction, a scene category corresponding to the second control instruction may also be obtained; if the scene category is different from the current scene, switching the current scene into a scene corresponding to the scene category. The scene category is acquired by carrying out semantic analysis on the voice information, so that the scene can be switched under the condition that the user does not feel.

As a way, the terminal device may further store preset wake-up words and control instructions, and a one-to-one correspondence between the control instructions and the scenes, that is, each preset wake-up word corresponds to one control instruction, and each control instruction corresponds to one scene. Specifically, when the first voice information is detected to comprise a preset wake-up word, a control instruction corresponding to the preset wake-up word is used as a second control instruction, and the category of a scene corresponding to the second control instruction is obtained.

It should be noted that the preset wake-up word is valid in all scenes, i.e. can be detected, and the keyword corresponding to each scene cannot be identified in other scenes. For example, the preset wake-up word is a music scene corresponding to "play music", and the keywords corresponding to the music scene may include "last song", "next song", "pause", etc., if it is detected that the voice information input by the user includes "play music" in the video scene, the music scene is used as the current scene category, and the current scene is switched to the music scene, so that the keywords corresponding to the mode information of the music scene such as "last song" may be used for control; and if the terminal equipment acquires the voice information containing the previous song in the video scene, the terminal equipment does not respond at all.

As a way, the scene category may be identified according to whether the interaction scene is continuous or non-continuous, the identification of the scene category may be obtained while the scene category is obtained, and when the scene category corresponding to the second control instruction is different from the current scene and the identification corresponding to the current scene is continuous, the second control instruction may be executed only when the scene category corresponding to the second control instruction is non-continuous, but the current scene is not switched. Only when the scene category corresponding to the second control instruction is different from the current scene and the scene category and the current scene are persistent scenes, switching the current scene into the scene corresponding to the scene category. For example, the current scene is a music playing scene with continuous interaction, and the second control instruction is "turn on" to correspond to the intelligent device control scene with non-continuous interaction, the terminal device can still keep the music playing scene after turning on, i.e. the terminal device can use the keywords corresponding to the scene to interact, without entering the music playing scene through input voice again. By the method, the control instruction in the non-continuous interaction scene can be prevented from interrupting continuous interaction between the user and the terminal equipment, so that scene switching is required frequently, and the interaction cost is further reduced.

Step S470: and executing the second control instruction.

The step of executing the second control instruction may refer to the step S240 of executing the first control instruction.

In some embodiments, after the mode information of the current scene is acquired, whether the first voice information includes a preset wake-up word may also be detected; if the first voice information is included, carrying out semantic analysis on the first voice information to obtain a second control instruction, and executing the second control instruction; if the first voice information input by the user does not contain the keyword, detecting whether the first voice information input by the user contains the keyword corresponding to the mode information, determining a corresponding first control instruction according to the keyword when the first voice information input by the user contains the keyword corresponding to the mode information, and executing the first control instruction.

In other embodiments, after the first voice information is obtained, whether the first voice information includes a keyword corresponding to the mode information and whether the first voice information includes a preset wake-up word may be detected, where the preset wake-up word and the keyword are different words, and there is no case that the keyword and the wake-up word are detected successfully at the same time, if the keyword is detected, step S430 is performed, and if the preset wake-up word is detected, step S460 is performed. By detecting the preset wake-up words and the keywords at the same time, the detection efficiency of the voice input by the user can be improved, and the response can be faster.

According to the intelligent control method based on the voice, after the mode information of the current scene is acquired, whether the first voice information input by a user comprises keywords corresponding to the mode information or not can be detected, when the first voice information is detected to be sometimes, a corresponding first control instruction is determined according to the keywords, and the first control instruction is executed; and when the first voice information is detected to be not contained, detecting whether the first voice information contains a preset wake-up word or not by the root, and if so, carrying out semantic analysis on the first voice information to acquire a second control instruction and instructing the second control instruction. According to the intelligent control method based on the voice, a user can directly send out a control command through the keyword corresponding to the mode information of the scene, can also carry out more diversified control commands through the voice information with the wake-up word, and can also switch the scene according to the voice information with the wake-up word.

Referring to fig. 6, a block diagram of a voice-based intelligent control device according to an embodiment of the present application is provided, where the device 600 includes: the acquisition module 610, the detection module 620, the instruction determination module 630, and the execution module 640.

An obtaining module 610, configured to obtain mode information of a current scene.

Further, before acquiring the mode information of the current scene, the apparatus 600 further includes: the system comprises a second voice acquisition module, a wake-up execution module and a scene acquisition module.

And the second voice acquisition module is used for acquiring second voice information input by the user.

And the wake-up execution module is used for executing wake-up operation if the second voice information comprises a preset wake-up word, wherein the preset wake-up word is different from the keyword.

The scene acquisition module is used for acquiring a scene corresponding to the second voice information, and taking the scene as a current scene.

Further, each scene includes at least one mode information, and each mode information includes at least one keyword and a first control instruction corresponding to the keyword.

The detection module 620 is configured to detect whether a keyword corresponding to the mode information is included in the first voice information input by the user.

Further, the detection module 620 further includes: the voice print recognition module comprises a voice print recognition sub-module, a similarity acquisition module and a first judgment sub-module.

And the voiceprint recognition sub-module is used for carrying out voiceprint recognition on the first voice information to acquire target voiceprint characteristics.

The similarity obtaining module is used for judging the similarity between the target voiceprint feature and the appointed voiceprint feature, wherein the appointed voiceprint feature is the voiceprint feature of the second voice information.

And the first judging sub-module is used for detecting whether the first voice information input by the user comprises keywords corresponding to the mode information or not if the similarity is larger than a first preset threshold value.

Further, the detection module 620 further includes: a voice analysis sub-module and a second judgment sub-module.

The voice analysis sub-module is used for analyzing the first voice information input by the user based on a preset acoustic model so as to acquire the acoustic feature similarity of the keywords corresponding to the first voice information and the mode information;

and the second judging sub-module is used for judging that the first voice information comprises a keyword corresponding to the mode information if the acoustic feature similarity is larger than a second preset threshold value.

The instruction determining module 630 is configured to determine, when the detection is in some cases, a corresponding first control instruction according to the keyword.

The execution module 640 is configured to execute the first control instruction.

Further, the apparatus 600 further comprises: the system comprises a wake-up word detection module, a semantic analysis module and a second execution module.

The wake-up word detection module is used for detecting whether the first voice information contains a preset wake-up word or not when the first voice information is detected that the key words corresponding to the mode information are not available.

And the semantic analysis module is used for carrying out semantic analysis on the first voice information to acquire a second control instruction if the first voice information is included.

And the second instruction execution module is used for executing the second control instruction.

Further, after performing semantic parsing on the first voice information to obtain the second control instruction, the apparatus 600 further includes: the scene acquisition module and the scene switching module.

The scene acquisition module is used for acquiring the scene category corresponding to the second control instruction.

And the scene switching module is used for switching the current scene into a scene corresponding to the scene category if the scene category is different from the current scene.

It should be noted that, for convenience and brevity, specific working procedures of the apparatus and modules described above may refer to corresponding procedures in the foregoing method embodiments, and are not described herein again.

In several embodiments provided by the present application, the coupling of the modules to each other may be electrical, mechanical, or other. In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

In summary, the embodiment of the application discloses an intelligent control method, an intelligent control device, an electronic device and a storage medium based on voice, and relates to the field of voice recognition. According to the embodiment of the application, by acquiring the mode information of the current scene, when the first voice information comprises the keywords corresponding to the mode information, the first control instruction can be determined according to the keywords; the terminal equipment can directly execute the first control instruction in a wake-up-free state, so that the control efficiency of the terminal equipment is improved to a certain extent; the user can send out the corresponding control instruction only through the keyword, so that the situation that the user needs to speak the wake-up word first when sending out the voice instruction every time is avoided, and the use experience of the user is effectively improved.

An electronic device according to the present application will be described with reference to fig. 7.

Referring to fig. 7, which is based on the foregoing voice-based intelligent control method, apparatus, electronic device and storage medium, another electronic device 700 capable of executing the foregoing voice-based intelligent control method is provided in the embodiments of the present application. The electronic device 700 includes one or more (only one shown) processors 710 and memory 720 coupled to each other. The memory 720 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 710 is capable of executing the program stored in the memory 720, and the memory 720 includes the apparatus described in the foregoing embodiments.

Wherein the processor 710 may include one or more processing cores. The processor 710 utilizes various interfaces and lines to connect various portions of the overall electronic device 700, perform various functions of the electronic device 700, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 720, and invoking data stored in the memory 720. Alternatively, the processor 710 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 710 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a video image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 710 and may be implemented solely by a single communication chip.

The Memory 720 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 720 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 720 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, a video image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the electronic device 700 in use (e.g., phonebook, audiovisual data, chat log data), and the like.

It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 7 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, electronic device 700 may also include more or fewer components than shown in FIG. 7, or have a different configuration than shown in FIG. 7.

Referring to fig. 8, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 800 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. An intelligent control method based on voice, which is characterized by comprising the following steps:

acquiring mode information of a current scene;

detecting whether the first voice information input by the user comprises a keyword corresponding to the mode information;

when the detection is in some cases, determining a corresponding first control instruction according to the keywords;

executing the first control instruction;

when detecting that the first voice information does not have the keyword corresponding to the mode information, detecting whether the first voice information contains a preset wake-up word or not;

if so, carrying out semantic analysis on the first voice information to acquire a second control instruction;

executing the second control instruction;

before the mode information of the current scene is acquired, the method further comprises:

acquiring second voice information input by a user;

if the second voice information comprises a preset awakening word, performing awakening operation, wherein the preset awakening word is different from the keyword;

acquiring a scene corresponding to the second voice information, and taking the scene as the current scene;

each scene comprises at least one mode information, and each mode information comprises at least one keyword and a first control instruction corresponding to the keyword;

After the semantic parsing of the first voice information to obtain a second control instruction, the method further includes:

acquiring a scene category corresponding to the second control instruction;

and if the scene category is different from the current scene, switching the current scene into a scene corresponding to the scene category.

2. The method of claim 1, wherein detecting whether the first voice information input by the user includes a keyword corresponding to the mode information comprises:

voiceprint recognition is carried out on the first voice information so as to obtain target voiceprint characteristics;

obtaining the similarity between the target voiceprint feature and a designated voiceprint feature, wherein the designated voiceprint feature is the voiceprint feature of the second voice message;

if the similarity is larger than a first preset threshold, detecting whether the first voice information input by the user comprises keywords corresponding to the mode information.

3. The method according to any one of claims 1 to 2, wherein the detecting whether the first voice information input by the user includes a keyword corresponding to the mode information includes:

Analyzing the first voice information input by a user based on a preset acoustic model to obtain the acoustic feature similarity of the keyword corresponding to the first voice information and the mode information;

and if the acoustic feature similarity is larger than a second preset threshold, judging that the first voice information comprises keywords corresponding to the mode information.

4. An intelligent voice-based control device, the device comprising:

the acquisition module is used for acquiring the mode information of the current scene;

the detection module is used for detecting whether the first voice information input by the user comprises keywords corresponding to the mode information;

the instruction determining module is used for determining a corresponding first control instruction according to the keyword when the detection is in a certain state;

the execution module is used for executing the first control instruction;

the wake-up word detection module is used for detecting whether the first voice information contains a preset wake-up word or not when detecting that the first voice information does not contain a keyword corresponding to the mode information;

the semantic analysis module is used for carrying out semantic analysis on the first voice information to obtain a second control instruction if the first voice information is contained;

The second instruction execution module is used for executing the second control instruction;

before the obtaining the mode information of the current scene, the device further comprises:

the second voice acquisition module is used for acquiring second voice information input by a user;

the wake-up execution module is used for executing wake-up operation if the second voice information comprises a preset wake-up word, wherein the preset wake-up word is different from the keyword;

the scene acquisition module is used for acquiring a scene corresponding to the second voice information, and taking the scene as the current scene;

after the semantic parsing of the first voice information to obtain a second control instruction, the apparatus further includes:

the scene acquisition module is used for acquiring the scene category corresponding to the second control instruction;

5. An electronic device, comprising:

One or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-3.

6. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for executing the method according to any one of claims 1 to 3.