CN114121005A - Voice control method and device, electronic equipment and storage medium - Google Patents

Voice control method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114121005A
CN114121005A CN202111433111.8A CN202111433111A CN114121005A CN 114121005 A CN114121005 A CN 114121005A CN 202111433111 A CN202111433111 A CN 202111433111A CN 114121005 A CN114121005 A CN 114121005A
Authority
CN
China
Prior art keywords
instruction
target
voice
information
trigger condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111433111.8A
Other languages
Chinese (zh)
Inventor
陈科鑫
冉茂松
张晓帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202111433111.8A priority Critical patent/CN114121005A/en
Publication of CN114121005A publication Critical patent/CN114121005A/en
Priority to PCT/CN2022/121695 priority patent/WO2023093280A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/42Graphical user interfaces

Abstract

The application discloses a voice control method, a voice control device, electronic equipment and a storage medium, wherein the voice control method comprises the following steps: acquiring a voice instruction; identifying an instruction type of the voice instruction; when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction; and when the trigger condition corresponding to the trigger condition information is met, executing target operation on a target operation interface corresponding to the target execution information. The method can realize the realization of the non-instant instruction when the graphical interface is controlled by voice, and improve the user experience.

Description

Voice control method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of electronic device technologies, and in particular, to a voice control method and apparatus, an electronic device, and a storage medium.
Background
With the rapid progress of the technology level, the electronic device can receive the voice command sent by the user through the auditory modality and complete the corresponding interaction task by combining the voice recognition technology and the natural language processing technology. Therefore, the user can complete interface interaction operation through voice input. However, in some cases, a user may need to execute a corresponding interface operation when a certain condition is met, but in the related art, the non-instant triggered instruction of the type cannot be well completed, which affects user experience.
Disclosure of Invention
In view of the foregoing, the present application provides a voice control method, apparatus, electronic device, and storage medium.
In a first aspect, an embodiment of the present application provides a voice control method, where the method includes: acquiring a voice instruction; identifying an instruction type of the voice instruction; when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction; and when the trigger condition corresponding to the trigger condition information is met, executing target operation on a target operation interface corresponding to the target execution information.
In a second aspect, an embodiment of the present application provides a voice control apparatus, where the apparatus includes: the system comprises an instruction acquisition module, an information acquisition module and an operation execution module, wherein the instruction acquisition module is used for acquiring a voice instruction; the information acquisition module is used for acquiring trigger condition information and target execution information corresponding to the voice instruction when the instruction type of the voice instruction is a non-instant type; and the operation execution module is used for executing target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the voice control method provided by the first aspect above.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the voice control method provided in the first aspect.
In a fifth aspect, the present application provides a computer program product, which includes computer programs/instructions, and is characterized in that the computer programs/instructions, when executed by a processor, implement the voice control method provided in the first aspect.
According to the scheme provided by the application, by acquiring the voice command, when the command type of the voice command is a non-instant type, the triggering condition information and the target execution information corresponding to the voice command are acquired, and when the triggering condition corresponding to the triggering condition information is met, the target operation is executed on the target operation interface corresponding to the target execution information. Therefore, after the trigger condition information of the non-instant voice command input by the user is identified, the corresponding interface operation is executed according to the trigger condition information, so that the non-instant voice command can be better completed, and the user experience is further improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 shows a scene schematic diagram provided in an embodiment of the present application.
Fig. 2 shows another schematic view of a scenario provided in an embodiment of the present application.
Fig. 3 is a schematic diagram illustrating an application environment provided by an embodiment of the present application.
Fig. 4 shows another schematic diagram of an application environment provided by the embodiment of the present application.
FIG. 5 shows a flow diagram of a voice control method according to one embodiment of the present application.
Fig. 6 shows a schematic view of another scenario provided in the embodiment of the present application.
FIG. 7 shows a flow diagram of a voice control method according to another embodiment of the present application.
FIG. 8 shows a flow diagram of a voice control method according to yet another embodiment of the present application.
FIG. 9 shows a flow diagram of a voice control method according to yet another embodiment of the present application.
Fig. 10 is a schematic diagram illustrating an instruction recognition model provided by an embodiment of the present application.
FIG. 11 shows a flow diagram of a voice control method according to yet another embodiment of the present application.
Fig. 12 shows a schematic structural diagram of an instruction type identification model provided in an embodiment of the present application.
FIG. 13 is a flow chart illustrating a voice control method according to yet another embodiment of the present application.
FIG. 14 shows a block diagram of a voice control device according to one embodiment of the present application.
Fig. 15 is a block diagram of an electronic device for executing a voice control method according to an embodiment of the present application.
Fig. 16 is a storage unit according to an embodiment of the present application, configured to store or carry program codes for implementing a voice control method according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The popularization of intelligent terminal equipment brings various conveniences to life. At the beginning of birth of intelligent terminal equipment, GUI (graphical User Interface) has been used as an important carrier for interaction between a User and a smart phone, and GUI may also be referred to as UI for short. Today, Voice interaction is continuously developed, interaction between intelligent Voice interaction and intelligent terminal equipment which is more convenient to use becomes an important man-machine interaction means, and a VGUI (Voice Graphic User Interface) can provide a more convenient and direct service interaction means for a User, or provides a seamless interaction solution under a barrier situation for the User when the User is not convenient to touch a GUI.
However, the inventor has studied for a long time and found that in the related art, the VGUI solution focuses on user instructions that are executed immediately and cannot support the technical capability of non-immediate instructions well. For example, as shown in fig. 1, a scenario of voice control is that a user inputs "send a short message to a mother" through voice, after analyzing voice information, an intelligent terminal device searches for a sent application on a GUI interface, and then executes an operation to send a corresponding short message, where an instruction of the user is executed immediately; in the voice control scenario shown in fig. 2, the user inputs "send a short message to mom when i arrives at home" through voice, at this time, the user adds a trigger condition "when i arrives at home" on the basis of expression, and changes the command into a command of a non-instant voice control graphical interface triggered only when the condition is met, however, the intelligent terminal may not execute the command when the trigger condition is met, but executes the command immediately, so that the non-instant voice command cannot be well realized, and inconvenience is brought to the user.
In view of the above problems, the inventor provides a voice control method, a device, an electronic device, and a storage medium, which are provided by the embodiments of the present application, and can implement a non-immediate type voice instruction input by a user, recognize trigger condition information of the voice instruction, and then execute a corresponding interface operation according to the trigger condition information, so that the non-immediate type voice instruction can be completed better, and user experience can be improved. The specific voice control method is described in detail in the following embodiments.
The following first introduces an application scenario related to the embodiment of the present application.
In the embodiment of the present application, the voice control method provided in the embodiment of the present application may be executed by an electronic device. In this manner performed by the electronic device, all steps in the voice control method provided by the embodiment of the present application may be performed by the electronic device. For example, as shown in fig. 3, a voice command may be collected by a voice collecting device of the electronic device 100, and then both the collected voice command and the current user interface are transmitted to the processor, so that after the processor identifies the command type of the voice command, the steps related to the voice control method provided by the present application are executed according to the identified command type.
Moreover, the voice control method provided by the embodiment of the application can also be executed by a server (cloud). Correspondingly, in the mode executed by the server, the electronic device may collect the voice command, synchronously send the collected voice command and the current user interface to the server, and then after the server recognizes the voice command, the server triggers the electronic device to execute the target operation.
In addition, the method can be executed by cooperation of the electronic device and the server. In this manner, the electronic device and the server cooperatively perform some steps in the voice control method provided by the embodiment of the present application, and some other steps are performed by the electronic device and the server.
For example, as shown in fig. 4, the electronic device 100 may obtain a voice instruction, then deliver the voice instruction to the server 200 to identify an instruction type of the voice instruction, and when the instruction type is a non-instant type, identify target execution information corresponding to the voice instruction and trigger condition information corresponding to the target execution information, and then return to the electronic device 100; the electronic device 100 then executes the target operation on the target interface corresponding to the target execution information according to the trigger condition information.
It should be noted that, in this manner executed by the electronic device and the server cooperatively, the steps executed by the electronic device and the server respectively are not limited to the manner described in the above example, and in practical applications, the steps executed by the electronic device and the server respectively may be dynamically adjusted according to actual situations.
It should be noted that the electronic device 100 may be a car-mounted device, a wearable device, a tablet computer, a notebook computer, a smart sound box, or the like, in addition to the smart phone shown in fig. 1 and fig. 2. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server, and the like, which is not limited herein.
The following describes a speech control method provided in an embodiment of the present application in detail with reference to the accompanying drawings.
Referring to fig. 5, fig. 5 is a flowchart illustrating a voice control method according to an embodiment of the present application. In a specific embodiment, the voice control method is applied to the voice control apparatus 400 shown in fig. 14 and the electronic device 100 (fig. 15) equipped with the voice control apparatus 400. The following will describe a specific process of this embodiment by taking an electronic device as an example, and it is understood that the electronic device applied in this embodiment may be a smart phone, a tablet computer, a smart watch, smart glasses, a notebook computer, and the like, which is not limited herein. As will be described in detail with respect to the flow shown in fig. 5, the voice control method may specifically include the following steps:
step S110: and acquiring a voice instruction.
In the embodiment of the application, the user can express the control intention of the user by inputting voice to the electronic equipment. Correspondingly, the electronic device can take the voice uttered by the user as the voice instruction. The electronic equipment can acquire voice input by a user through the audio acquisition device, so that a voice instruction is obtained. The audio acquisition device is used for acquiring audio signals. Optionally, the audio acquisition means may comprise one or more audio acquisition devices, which may be microphones.
In some embodiments, the electronic device may detect a voice instruction input by a user when a function of the voice control graphical interface is turned on, and execute the steps related to the voice control method provided by the present application according to the detected voice instruction. Alternatively, the electronic device may collect the voice command input by the user through the voice collecting device under the condition that the voice assistant is turned on. For example, when a voice assistant of the electronic device is turned on, the user inputs a voice command "xiao europe" to help me open the NFC door access when me goes home, and then the electronic device may collect the voice command.
In other embodiments, the electronic device may collect voice input by the user when the triggering operation of the voice control is detected, so as to obtain the voice instruction input by the user. Optionally, a control for controlling the graphical interface by voice may be displayed in a screen of the electronic device, and when a corresponding operation on the control is detected, voice collection may be started, so as to obtain a voice instruction input by the user. The operation may be a click operation, a press operation, a slide operation, and the like, which is not limited herein. Optionally, the electronic device may also collect voice input by the user when detecting the operation of the designated entity key, so as to obtain the voice instruction input by the user. Of course, the specific manner in which the electronic device obtains the voice command may not be limited.
Step S120: and when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction.
In the embodiment of the application, after the electronic device acquires the voice instruction, the instruction type of the voice instruction can be identified, so that the condition of a non-instant control interface required by a user can be identified, and the control required by the user can be accurately realized. The instruction type of the voice instruction can include an instant type and a non-instant type, wherein the instant type refers to a type corresponding to an instruction which needs to be executed immediately after the electronic device obtains the voice instruction; the non-instant type represents an instruction type which is not executed immediately after the electronic device acquires the voice instruction, but is executed only when corresponding conditions are met.
In some embodiments, the electronic device may recognize the obtained voice instruction to obtain text content corresponding to the voice instruction. After the text content corresponding to the voice command is obtained, semantic recognition can be performed on the text content based on a pre-configured mode, so that the command type of the voice command is recognized to be an instant type or a non-instant type. The electronic device may convert the voice command into corresponding text content based on a preconfigured Automatic Speech Recognition mode (Automatic Speech Recognition).
As a possible implementation manner, the electronic device may recognize text content corresponding to the voice instruction based on a pre-trained instruction type recognition model, so as to obtain an instruction type corresponding to the voice instruction. The instruction type recognition model can be obtained by training based on the text sample data marked with the instruction type.
In other embodiments, after obtaining the text content corresponding to the voice instruction, the electronic device may also perform word segmentation on the text content, and then obtain the keywords in the text content according to the word segmentation result; the electronic equipment matches the recognized keywords with preset keywords, wherein the preset keywords are keywords corresponding to preset non-instant voice instructions; if any one of the recognized keywords is matched with a preset keyword, determining that the instruction type of the voice instruction is a non-instant type; and if each recognized keyword is not matched with the preset keyword, determining the instruction type of the voice instruction as an instant type. Optionally, the preset keywords include keywords related to the trigger condition, such as "at", "when", "if", "when", "time", and the like. Illustratively, if the text content corresponding to the voice command input by the user is "when the battery power is 10%, the mobile data network is closed", the keyword "time" in the text content is matched with the preset keyword, and thus the command type of the voice command is determined as a non-immediate type.
Of course, the specific manner in which the electronic device recognizes the type of instruction of the voice instruction may not be limited.
In the embodiment of the application, after the electronic device identifies the instruction type of the voice instruction, the electronic device can execute the subsequent voice control step according to the instruction type of the voice instruction. The electronic device may determine whether an instruction type of the voice instruction is a non-immediate type; if the instruction type of the voice instruction is a non-instant type, the triggering condition information and the target execution information corresponding to the voice instruction can be identified, so that the voice control required by the user is completed according to the triggering condition information and the target execution information. The target execution information can be understood as control information which is acquired after the electronic equipment converts the voice command and is used for representing the control intention of the user on the interface; the trigger condition information may be understood as an execution condition corresponding to the target execution information, that is, when what condition is satisfied, an operation corresponding to the control information is executed, thereby completing the control required by the user.
In some embodiments, the electronic device may acquire an instruction text corresponding to the voice instruction, and then acquire trigger condition information and target execution information included in the instruction text.
As a possible implementation manner, the electronic device may perform semantic recognition on the text content obtained by converting the voice instruction by using a preconfigured method based on the text content, and then determine the trigger condition information and the target execution information according to a semantic recognition result.
Optionally, the control intention, the control object, the object auxiliary information, and the trigger condition in the text content may be extracted based on a Natural Language Understanding (NLU) manner, and integrated into a quadruple with a { action, object, information, condition } style, and in this manner, the result of semantic recognition is the quadruple. Wherein, the action represents the control intention, or can be understood as the control purpose, the object represents the control object, the information represents the object auxiliary information, and the condition represents the trigger condition. The control intention, the control object and the object auxiliary information may be target execution information, and the trigger condition is the trigger condition information.
Illustratively, the text content obtained by converting the voice instruction is that when the mom calls, a reply mom short message is temporarily unavailable. Based on the natural language understanding mode, the user intention can be understood as 'sending a short message', the control object is 'mom', the object auxiliary information is 'temporarily unavailable', and the control condition is 'when mom calls', the four-tuple is recorded as: { (text messaging), (mom), (temporarily unavailable), (mom incoming call) }.
As another possible implementation manner, the electronic device may also recognize text content corresponding to the voice instruction through a pre-trained instruction recognition model, so as to obtain target execution information corresponding to the voice instruction and trigger condition information corresponding to the target execution information. The instruction recognition model can be obtained by training according to a text sample which is labeled with triggering condition information and target execution information in advance.
Of course, the specific manner of specifically recognizing the target execution information and the trigger condition information corresponding to the voice command by the electronic device may not be limited.
Step S130: and when the trigger condition corresponding to the trigger condition information is met, executing target operation on a target operation interface corresponding to the target execution information.
In this embodiment of the application, when the voice instruction is a non-instant type, and the electronic device recognizes and obtains the trigger condition information and the target execution information corresponding to the voice instruction, the electronic device may execute the target execution information according to the trigger condition information. The electronic device can execute the target operation on the target operation interface corresponding to the target execution information under the condition that the trigger condition corresponding to the trigger condition information is met.
For example, referring to fig. 6, if the user inputs "go home, set the flight mode", the trigger condition information is "go home", and the target execution information is set the flight mode ", and the electronic device may set the switch control corresponding to the flight mode to the on state in the setting interface of the device mode when determining that the device position is the home position according to the positioning information.
For another example, in the example of the quadruple information corresponding to the recognized voice command, the obtained quadruple is: { (short message sending), (mom), (temporarily blank), (mom incoming call) }, the electronic device may switch to a sending interface of the short message application when receiving the mom incoming call, write "mom" in the addressee edit box, write "temporarily blank" in the text edit box, and then execute the short message sending.
In some embodiments, when the electronic device executes a target operation on a target operation interface corresponding to the target execution information, a control instruction of the target interface corresponding to the target execution information may be generated by a method of system injection (an operation method supported by Android) or screen click simulation. For example, when the trigger condition information is satisfied, a user click operation may be simulated, a target interface corresponding to the target execution information may be switched to, and a user click operation may be simulated to execute target control corresponding to the target execution information. Therefore, the corresponding target operation is executed on the target interface corresponding to the target execution information.
In the embodiment of the application, if the instruction type of the voice instruction is identified to be the instant type, the target operation information corresponding to the voice instruction can be identified so as to complete the real-time voice control required by the user. The target operation information corresponding to the voice instruction may be target execution information corresponding to the voice instruction, and the manner of identifying the target execution information corresponding to the voice instruction may refer to the manner of identifying the interface target execution information in the foregoing embodiment, which is not described herein again. It should be noted that, when the instruction type of the voice instruction is an instant type, the current user interface may be subjected to real-time voice control, or other interfaces of the electronic device may be subjected to real-time voice control.
In some embodiments, after the electronic device identifies and obtains the target operation information, the electronic device may match the target operation information with the interface operable element to obtain an interface operable element in the matched interface, and execute an operation corresponding to the interface operable element in the interface. The description of the embodiments may refer to the foregoing embodiments, and will not be repeated herein.
In a possible embodiment, in case the instruction type is an instant type, it may be a real-time voice control of the current user interface. In the process of making a voice, a user may make the made voice more random due to a pronunciation habit problem, but a voice instruction corresponding to the more random voice may not enable the electronic device to accurately determine the control intention of the user. For example, if the content corresponding to the voice command itself is "next", the meaning corresponding to the next possible meaning may be next one, and the corresponding meaning may also be downloaded one. For example, the next possible meaning in an audio playback scenario may be the next one, e.g., playing the next song. In the software downloading scenario, the next possible corresponding meaning may be downloading one. For example, an application is downloaded. Therefore, in order to more accurately determine the real intention of the user, the target operation information corresponding to the voice instruction can be updated according to the task scene corresponding to the current user interface to obtain a scene control instruction; and matching the scene control instruction with the interface operable elements of the current user interface so as to determine the target operable elements from the interface operable elements of the current user interface. In the above example, if the current application scene is a music playing scene, the target operation information "next music" may be updated to "play next song", and if the current application scene is an application downloading scene, the target operation information "next music" may be updated to "download a music playing application". Thus, more accurate voice control can be realized.
The voice control method provided by the embodiment of the application can realize the identification of the instruction type of the voice instruction input by the user, and after the trigger condition information of the non-instant voice instruction input by the user is identified, the corresponding interface operation is executed according to the trigger condition information, so that the non-instant voice instruction can be better completed, and the user experience is further improved.
Referring to fig. 7, fig. 7 is a flowchart illustrating a voice control method according to another embodiment of the present application. The voice control method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 7, and the voice control method may specifically include the following steps:
step S210: and acquiring a voice instruction.
Step S220: and when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction.
In the embodiment of the present application, reference may be made to the contents of the other embodiments in steps S210 to S220, which are not described herein again.
In the embodiment of the present application, step S210 and step S220 may refer to the contents of other embodiments, and are not described herein again.
Step S230: and when the trigger condition corresponding to the trigger condition information is met, matching the target execution information with an interface operable element to obtain the target operation interface, and executing the target operation on the target operation interface.
In the embodiment of the application, when the electronic device executes the voice control according to the triggering condition information and the target execution information, the target execution information and the interface operable element may be matched to obtain the target operation interface under the condition that the triggering condition corresponding to the triggering condition information is satisfied, and the target operation is executed on the target operation interface.
In some implementations, the electronic device can be pre-identified with interface actionable elements for a variety of interfaces in the electronic device. The interface refers to an interface that can be executed and displayed by the electronic device, and may include a system interface, an interface corresponding to each installed application program, and the like, which is not limited herein. The electronic device may match the identified target execution information with interface operable elements of a pre-identified interface, so as to obtain the interface operable elements matched in the target operation interface matched with the target execution information as target operable elements, and then execute operations corresponding to the target operable elements on the target interface. For example, for a short message sending interface of a short message application, the short message sending interface includes an edit box of a recipient and a text edit box, and if the identified target execution information is "send a short message to mom, say, it is temporarily unavailable", the matched interface operation elements are: when the electronic device executes the operation corresponding to the target operable element on the target interface, the electronic device can write 'mom' into the recipient edit box and 'temporary vacancy' into the text edit box in the short message sending interface, and then trigger the sending control to realize the short message sending.
It will be appreciated that for an interface, a variety of user-operable interface operable elements may be included. The interface operable element can comprise a control in the interface or can be specific to the whole interface. For example, if the user intends to perform a page sliding (e.g., slide up, slide down, slide left, and slide right), or intends to perform an interface switching, or to exit an interface, then the interface operable element is the entire interface. As another example, if the user's intent is to click on a location in the interface, then the interface actionable element can be for a control in the interface. When there are a plurality of operations to be performed on the interface, there may be a plurality of interface operable elements corresponding to the interface.
In some embodiments, the electronic device identifies the interface operable element of the interface, and may include at least one of the following identification modes: identifying the interface based on a code analysis mode; identifying the interface based on the image-text identification mode; and identifying the interface based on the control classification model.
As a possible implementation manner, the interface is identified based on a code parsing manner to obtain the interface operable element corresponding to the interface, which may be understood as identifying a component or a component included in the interface based on the code parsing manner, and the obtained interface operable element may include an identifier and description information of the component that can be identified. Correspondingly, the interface is identified based on the code parsing mode, which can be understood as obtaining the components included in the interface and the description information corresponding to the components based on the code parsing mode. The description information may include information such as name, function, trigger operation, etc. Alternatively, the interface can be identified based on a code parsing manner based on the Google barrier-free service access availability implementation.
As a possible implementation manner, the interface is recognized based on a text-text Recognition manner, which may include an OCR (Optical Character Recognition) manner, and components, controls, icons, and the like in the interface are recognized, so as to obtain description information of the recognized components, controls, icons, and the like. Specifically, positions of components, controls and icons in the user interface can be identified in an OCR mode, traversal is then performed, all the components, controls, icons and the like in the user interface are obtained, and then description information of the components, controls, icons and the like is determined by analyzing image content.
As a possible implementation, the training process of the control classification model includes: acquiring a user interface; acquiring controls classified from a user interface; and training the neural network to be trained through the classified control to obtain a control classification model.
The electronic device may store interface operable elements of multiple interfaces obtained through pre-recognition, so that when voice control is performed, target execution information corresponding to the voice instruction is matched, and thus, interface operable elements in the corresponding target operation interfaces are obtained and serve as the target operable elements.
In other embodiments, the electronic device may further identify interface operable elements of multiple interfaces in the electronic device under the condition that the trigger condition corresponding to the trigger condition information is satisfied, match the target execution information with the interface operable elements to obtain a target operation interface, and execute the target operation on the target operation interface. The manner in which the electronic device identifies the interface operable element may refer to the foregoing embodiments, and is not described herein again.
In some embodiments, the trigger condition information may include a condition field and condition parameters of the trigger condition, and the target execution information may include an execution field and execution parameters. The condition field refers to a service type to which a trigger condition of a non-instant type voice instruction belongs, for example, a battery level of the electronic device, a location where the electronic device is located, a time of the electronic device, a date of the electronic device, a message notification received by the electronic device, a received incoming call, and the like; the condition parameter refers to a specific trigger state or parameter value and the like of a service to which the non-instant voice instruction belongs, for example, a specific electric quantity value, a specific time, a specific date, a specific position, what message is received, what incoming call is received and the like; the execution domain refers to a service domain to which operations specifically executed by a non-instant voice command belong, for example, controlling home appliances, controlling parameters of electronic devices, operating application software, and the like; the execution parameter refers to a specific operation parameter corresponding to an operation of the execution field, for example, a specific set temperature for controlling the smart air conditioner, a specific numerical value of the device volume of the electronic device, and the like.
In this embodiment, when the electronic device matches the target execution information with the interface operable element, the execution domain and the execution parameter may be matched with the interface operable element, so as to obtain the target operable element in the target operation interface matched with the target execution information. The electronic device can determine a corresponding interface according to the execution field, and then match the interface operable element of the interface according to the execution parameter, so as to obtain the target operable element matched with the target execution information. For example, if the execution field is to control the smart air conditioner, the interface is a control interface of the smart air conditioner corresponding to the smart home application program, and then according to the execution parameter, if the execution parameter is to reduce the temperature of the smart air conditioner, it may be determined that the matched interface operable element is: and the control is used for reducing the temperature of the intelligent air conditioner.
In the embodiment, the target execution information is divided into the execution field and the execution parameters, so that when the interface operable elements of the interface are matched, the electronic equipment can be matched with the more accurate interface operable elements, and the matching efficiency can be improved; similarly, the trigger condition information is divided into the condition field and the condition parameters, so that when the electronic equipment executes the matched interface operable elements according to the trigger condition information, the target operation can be executed on the target operation interface more accurately when the trigger condition information is met, and the accuracy of voice control is improved.
According to the voice control method provided by the embodiment of the application, after the corresponding trigger condition information and the target execution information are identified for the non-instant voice instruction input by the user, when the trigger condition corresponding to the trigger condition information is met, the target execution information is matched with the interface operable element, so that the interface operation required by the user can be quickly and accurately determined, and then the corresponding interface operation is executed, so that the non-instant voice instruction can be better completed, and the user experience is further improved.
Referring to fig. 8, fig. 8 is a flow chart illustrating a voice control method according to another embodiment of the present application. The voice control method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 8, where the voice control method may specifically include the following steps:
step S310: and acquiring a voice instruction.
Step S320: and when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction.
In the embodiment of the present application, step S310 and step S320 may refer to the contents of other embodiments, and are not described herein again.
Step S330: and matching the target execution information with interface operable elements to obtain the target operation interface.
Different from the previous embodiment, in the embodiment of the present application, after acquiring the trigger condition information and the target execution information corresponding to the voice instruction, the electronic device may match the target execution information with the interface operable element to obtain the target operation interface, so that when the trigger condition corresponding to the trigger condition information is met, the target operation is executed on the target operation interface corresponding to the target execution information. The manner in which the electronic device matches the target execution information with the interface operable element may refer to the content of the previous embodiment, which is not described herein again.
Step S340: and generating a corresponding control instruction according to the trigger condition information and the target operation interface.
In the embodiment of the application, after the electronic device identifies the trigger condition information and the target operation interface, the electronic device may synthesize the control instruction according to the trigger condition information and the target operation interface, and transmit the synthesized control instruction to the graphical interface for execution. The electronic device can generate a corresponding control instruction according to the trigger condition information and the matched interface operable element in the target operation interface. Optionally, a generation manner of the IFTTT instruction may be adopted, and a corresponding control instruction is generated according to the trigger condition information and the target operation interface.
Step S350: and executing the control instruction, wherein the control instruction is used for executing target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
In the embodiment of the application, because the control instruction is a non-instant type instruction, the control instruction exists in the application background according to the specific trigger condition information and the interface operable element of the target operation interface; in addition, the state of the trigger condition information is monitored in real time, for example, the condition field and the actual state of the condition parameter in the foregoing embodiment are monitored, and the corresponding execution module (for executing the target execution information) is in a standby state (existing in the memory); when the information meeting the trigger condition is detected, the execution module immediately executes the operation corresponding to the interface operable element in the target operation interface, so that the non-instant voice control process is completed.
According to the voice control method provided by the embodiment of the application, after the corresponding trigger condition information and the target execution information are identified for the non-instant voice instruction input by the user, the target execution information is matched with the interface operable element, so that the interface operation required by the user can be quickly and accurately determined, and then the corresponding interface operation is executed according to the trigger condition information, so that the non-instant voice instruction can be better completed, and the user experience is further improved.
Referring to fig. 9, fig. 9 is a flowchart illustrating a voice control method according to still another embodiment of the present application. The voice control method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 9, and the voice control method may specifically include the following steps:
step S410: and acquiring a voice instruction.
Step S420: and when the instruction type of the voice instruction is a non-instant type, acquiring an instruction text corresponding to the voice instruction.
Step S430: and inputting an instruction text corresponding to the voice instruction into a pre-trained instruction recognition model to obtain triggering condition information and target execution information contained in the instruction text, wherein the instruction recognition model is obtained by training based on a layered reinforcement learning mode.
In the embodiment of the application, when the electronic device determines that the instruction type of the voice instruction is a non-immediate type and identifies the trigger condition information and the target execution information corresponding to the voice instruction, the electronic device may input an instruction text corresponding to the voice instruction to an instruction identification model trained in advance based on a layered reinforcement learning manner, so as to obtain the trigger condition information and the target execution information contained in the instruction text. Due to the fact that various tasks of trigger condition information and target execution information need to be recognized, the instruction recognition model is obtained by training in a layered reinforcement learning mode, and recognition accuracy can be improved.
In some embodiments, the instruction identification model includes a first sub-module corresponding to the trigger condition information, a second sub-module corresponding to the target execution information, and a cooperative control module, where the first sub-module corresponding to the trigger condition information is used to identify an identification task of the trigger condition information, and the second sub-module corresponding to the target execution information is used to identify an identification task of the target execution information; the cooperative control module is used for deciding the action of the corresponding recognition task, and specifically can decide the execution sequence of the work tasks of the sub-module and assign the work tasks to the sub-module. In this manner, the trigger condition information and the target execution information can be identified more accurately by the model of the hierarchical reinforcement learning.
In this way, the training process of the instruction recognition model includes: creating a first sub-module corresponding to an identification task for identifying trigger condition information, a second sub-module corresponding to the identification task for identifying target execution information and a cooperative control module for coordinating the identification task, wherein the first sub-module and the second sub-module are used for deciding the action of the identification task corresponding to the first sub-module and the second sub-module, and the decision priority of the cooperative control module is higher than the decision priority of the first sub-module and the second sub-module; and performing deep reinforcement learning training on the cooperative control module, the first sub-module and the second sub-module based on the text sample labeled with the triggering condition information, the target execution information and the recognition sequence of the recognition task to obtain a trained instruction recognition model.
As a possible implementation manner, performing deep reinforcement learning training on the cooperative control module, the first sub-module, and the second sub-module based on a text sample labeled with trigger condition information, target execution information, and a recognition order of the recognition task to obtain the trained instruction recognition model may include: inputting the text sample to the cooperative control module, the first submodule and the second submodule to obtain output results of the first submodule and the second submodule and a coordinated execution sequence of the cooperative control module; determining a first reward corresponding to the first sub-module based on the output result of the first sub-module and the marked triggering condition information of the text sample; determining a second reward corresponding to the second sub-module based on the output result of the second sub-module and the labeled target execution information of the text sample; determining a third reward corresponding to the cooperative control module based on the execution sequence coordinated by the cooperative control module and the identification sequence marked by the text sample; and carrying out deep reinforcement learning training on the first submodule based on the first reward, carrying out deep reinforcement learning training on the second submodule based on the second reward, and carrying out deep reinforcement learning training on the cooperative control module based on the third reward until a preset termination condition is met, so as to obtain the trained instruction recognition model.
In this embodiment, the first sub-module is deeply and intensively trained based on the first reward, the second sub-module is deeply and intensively trained based on the second reward, and the cooperative control module is deeply and intensively trained based on the third reward, that is, the model training is performed, and an algorithm of the intensive learning training may not be limited, for example, an Advantage action review (A2C) algorithm, an asynchronous Advantage action review (A3C) algorithm, or a Deep Q-value Network (Deep Q-Network, DQN) algorithm, etc.
The following describes an instruction recognition model in the embodiment of the present application, taking as an example that the trigger condition information includes a condition field and a condition parameter, and the target execution information includes an execution field and an execution parameter. The condition field, the condition parameter, the execution field and the execution parameter are defined as four slots. The definitions of the condition field, the condition parameter, the execution field, and the execution parameter can refer to the contents of the foregoing embodiments, and are not described herein again.
The instruction recognition model is shown in fig. 10, wherein Agent refers to an Agent in the reinforcement learning concept and needs to be trained by training data. The instruction identification model comprises a top layer Agent and four bottom layer agents, and each bottom layer Agent is respectively responsible for identifying information of four specific slot positions to be identified; the top layer Agent judges which bottom layer Agent should be selected for identification according to the state of the top layer Agent, namely, the working sequence of the bottom layer agents is controlled. The top-layer Agent can be understood as the coordination control module, the bottom-layer Agent can be understood as the first sub-module and the second sub-module, and the first sub-module and the second sub-module correspond to the two bottom-layer agents respectively.
The meaning of the state of the top-level Agent is the work completion state of the current top-level Agent, and the work of the top-level Agent is responsible for assigning work to the bottom-level Agent, so the state of the top-level Agent can be expressed by using the following formula, SH represents the state of the top-level Agent, st1, st2, st3 and st4 represent the states of four bottom-level agents respectively, and the state of the bottom-level Agent stores the slot recognition confidence coefficient, the recognition content, the Action (Action, execution step of the reinforcement learning Agent) and the like which need to be recognized at present. Namely:
SH={st1,st2,st3,st4}
the training purpose of the underlying Agent is to identify the information (condition field, condition parameter, execution field and execution parameter) responsible for identification as accurately as possible, so in the training process, the training Reward (Reward, which refers to feedback of the Agent execution result in reinforcement learning) of the underlying Agent is defined as follows:
Figure BDA0003380914990000131
wherein st represents the execution state of the bottom layer Agent, and gt is the information labeled by the sample text; when the execution state is completely consistent with the real state, the training Reward is 1, when the execution state is null, the training Reward is-p, and p is a set null punishment; when the execution state is all but the above states, the training Reward is-1. The base Agent will calculate a training Reward each time it takes an action during the training process.
Furthermore, because the information in the bottom slot position has a relationship of mutual restriction and mutual dependence, after the information of one slot position is correctly identified, the information search range of another slot position to be identified can be narrowed. For example, if the execution field is recognized as adjusting the on-off state of an air conditioner, the execution function can only be the smart home control application in the electronic device, and if the execution function is recognized as the volume, the execution field can only be the volume adjusting button on the graphical interface of the electronic device. Therefore, the training purpose of the top-level Agent is to distribute the execution sequence of the bottom-level Agent as reasonably and efficiently as possible and preferentially identify the slot position with rich information. In the training process, the training Reward of the top-level Agent is defined as:
Figure BDA0003380914990000132
wherein the content of the first and second substances,
Figure BDA0003380914990000133
representing accumulated training Reward when the execution state of the top-level Agent is st and the real mark is gt; the Reward of the top-layer Agent is the effective accumulated Reward sum of the bottom-layer Agent until the ith step is reached and N steps are traced back; and, when the bottom layer Agent selected by the top layer Agent in a certain step is the same as the bottom layer Agent actually labeled (namely, the execution sequence is the same as the labeled execution sequence), the top layer Agent is considered as an effective accumulation, otherwise, the Reward of the step is marked as 0.
In addition, the instruction recognition model can also comprise an information specification module. And if the recognition result probability of the bottom layer Agent at a certain time is lower than a set threshold value, the information protocol module can match the slot position information recognized in the recognition result with the interface operable element so as to verify whether the recognition result is reasonable. For example, if the internet cloud music is identified as being "adjusted to a vibration mode", the information specification module may match the internet cloud music with the interface operable element, and it may be determined through matching that the internet cloud music is a software application and not an equipment terminal, and the operation cannot be performed; and the information protocol module marks the identification result with a mark for identifying incompleteness and needing further processing, and feeds the mark back to the top-level Agent. Therefore, the accuracy of identification can be improved.
Of course, the specific way of performing the deep reinforcement learning training on the instruction recognition model may not be limited.
Step S440: and when the trigger condition corresponding to the trigger condition information is met, executing target operation on a target operation interface corresponding to the target execution information.
In the embodiment of the present application, step S440 may refer to the contents of other embodiments, which are not described herein again.
According to the voice control method provided by the embodiment of the application, aiming at the non-instant voice instruction input by the user, after the instruction recognition model is obtained by training in advance according to a layered reinforcement learning mode, the corresponding trigger condition information and the target execution information are recognized, and then when the trigger condition corresponding to the trigger condition information is met, the target operation is executed on the target operation interface corresponding to the target execution information, so that the interface operation required by the user can be quickly and accurately determined, and then the corresponding interface operation is executed according to the trigger condition information, so that the non-instant voice instruction can be better completed, and the user experience is further improved. In addition, the instruction recognition model is obtained by training based on a layered reinforcement learning mode, so that the accuracy of recognizing the trigger condition information and the target execution information can be ensured, and the accuracy of voice control is further improved.
Referring to fig. 11, fig. 11 is a flowchart illustrating a voice control method according to yet another embodiment of the present application. The voice control method is applied to the electronic device, and will be described in detail with respect to the flow shown in fig. 11, where the voice control method may specifically include the following steps:
step S510: and acquiring a voice instruction.
In the embodiment of the present application, step S410 may refer to the contents of the foregoing embodiments, and is not described herein again.
Step S520: and inputting the instruction text corresponding to the voice instruction into a pre-trained instruction type recognition model to obtain the instruction type of the voice instruction, wherein the instruction type recognition model is used for recognizing the instruction type corresponding to the input instruction text as an even type or a non-instant type.
In the embodiment of the application, when the instruction type of the voice instruction is identified, the identification can be performed based on a pre-trained instruction type identification model. The text vector of the instruction text corresponding to the voice instruction can be input to a pre-trained instruction type recognition model, so that the instruction type of the voice instruction is obtained. The instruction type recognition model may be a text intention binary classification (whether it is a non-instantaneous type) Deep learning network based on a BERT (Deep Bidirectional transducers for semantic Understanding) model.
In some embodiments, referring to fig. 12, the instruction type recognition model is composed of an encoding module, a decoding module, and a BERT text classification model, wherein the BERT model may be a semantic model that is publicly pre-trained. After the instruction text is input into the instruction type identification model, firstly, the instruction text is converted into an input format received by a BERT text classification model through a coding module; after encoding, the encoded text vectors are input into a BERT network for classification; and then, decoding the classification result vector by a decoding device to obtain a classification result, wherein the classification result comprises two conditions of an instant type and a non-instant type.
In some embodiments, before inputting the instruction text corresponding to the voice instruction into the instruction type recognition model, the electronic device may further perform preset correction processing on the instruction text corresponding to the voice instruction; and then, inputting the instruction text after the preset correction processing into the instruction type recognition model to obtain the instruction type of the voice instruction.
Alternatively, the preset correction processing may include: the correction processing may be, but is not limited to, lexical calibration based on an edit distance, common lexical correction based on a bayesian method, or the like.
Step S530: and when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction.
Step S540: and when the trigger condition corresponding to the trigger condition information is met, executing target operation on a target operation interface corresponding to the target execution information.
In the embodiment of the present application, step S530 and step S540 may refer to the contents of the foregoing embodiments, and are not described herein again.
According to the voice control method provided by the embodiment of the application, when the instruction type of the voice instruction is identified, the instruction type is identified through a pre-trained instruction type identification model to obtain the instruction type; aiming at the non-instant voice instruction input by the user, the target execution information is executed according to the trigger condition information and the target execution information by recognizing the trigger condition information and the target execution information. Therefore, the type of the voice command can be better identified, the voice command of the instant type and the non-instant type can be completed, and the user experience is further improved.
Next, a speech control method according to the foregoing embodiment will be described with reference to fig. 13.
As shown in fig. 13, after the electronic device obtains the voice instruction, the voice instruction is analyzed to obtain a voice text corresponding to the voice instruction; judging the type of the non-instant type instruction; if the type is identified to be a non-instant type, identifying triggering condition information and target execution information through instruction identification; then, synthesizing an instruction according to the target execution information and the trigger condition information, and delivering the instruction to a graphical interface for execution; and if the type is identified as the instant type, directly matching the interface operable elements to obtain corresponding target operable elements, synthesizing the target operable elements into instructions, and delivering the instructions to a graphical interface for execution.
Referring to fig. 14, a block diagram of a voice control apparatus 400 according to an embodiment of the present disclosure is shown. The voice control apparatus 400 applies the above-mentioned electronic device, and the voice control apparatus 400 includes: an instruction acquisition module 410, an information acquisition module 420, and an operation execution module 430. The instruction obtaining module 410 is configured to obtain a voice instruction; the information obtaining module 420 is configured to obtain trigger condition information and target execution information corresponding to the voice instruction when the instruction type of the voice instruction is a non-immediate type; the operation executing module 430 is configured to execute a target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
In some embodiments, the operation execution module 430 may be specifically configured to: and when the trigger condition corresponding to the trigger condition information is met, matching the target execution information with an interface operable element to obtain the target operation interface, and executing the target operation on the target operation interface.
As a possible implementation, the operation executing module 430 may specifically be configured to: matching the target execution information with interface operable elements to obtain the matched interface operable elements in the target operation interface as target operable elements; and executing the operation corresponding to the target operable element on the target interface.
In some embodiments, the voice control apparatus 400 may further include an interface recognition module. The interface recognition module is used for: and when the trigger condition corresponding to the trigger condition information is met, matching the target execution information with interface operable elements to obtain the target operation interface before the target operation interface corresponding to the target execution information executes the target operation.
As a possible implementation, the operation executing module 430 may specifically be configured to: generating a corresponding control instruction according to the trigger condition information and the target operation interface; and executing the control instruction, wherein the control instruction is used for executing target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
In some embodiments, the information obtaining module 420 may be specifically configured to: acquiring an instruction text corresponding to the voice instruction; and acquiring trigger condition information and target execution information contained in the instruction text. .
In a possible implementation, the information obtaining module 420 may be specifically configured to: and inputting an instruction text corresponding to the voice instruction into a pre-trained instruction recognition model to obtain triggering condition information and target execution information contained in the instruction text, wherein the instruction recognition model is obtained by training based on a layered reinforcement learning mode.
Optionally, the instruction identification model includes a first sub-module corresponding to the trigger condition information, a second sub-module corresponding to the target execution information, and a cooperative control module. The speech control apparatus 400 may also include a model training module. The model training module may be to: creating a first sub-module corresponding to an identification task for identifying trigger condition information, a second sub-module corresponding to an identification task for identifying target execution information, and a cooperative control module for coordinating the identification task, wherein the first sub-module and the second sub-module are used for deciding actions of the identification tasks corresponding to the first sub-module and the second sub-module, and the decision priority of the cooperative control module is higher than the decision priority of the first sub-module and the second sub-module; and performing deep reinforcement learning training on the cooperative control module, the first sub-module and the second sub-module based on the text sample labeled with the triggering condition information, the target execution information and the recognition sequence of the recognition task to obtain the trained instruction recognition model.
Further, the model training module may be specifically configured to: inputting the text sample to the cooperative control module, the first submodule and the second submodule to obtain output results of the first submodule and the second submodule and a coordinated execution sequence of the cooperative control module; determining a first reward corresponding to the first sub-module based on the output result of the first sub-module and the marked triggering condition information of the text sample; determining a second reward corresponding to the second sub-module based on the output result of the second sub-module and the labeled target execution information of the text sample; determining a third reward corresponding to the cooperative control module based on the execution sequence coordinated by the cooperative control module and the identification sequence marked by the text sample; and carrying out deep reinforcement learning training on the first submodule based on the first reward, carrying out deep reinforcement learning training on the second submodule based on the second reward, and carrying out deep reinforcement learning training on the cooperative control module based on the third reward until a preset termination condition is met, so as to obtain the trained instruction recognition model.
In some embodiments, the speech recognition apparatus 400 may further include: and a type identification module. The type recognition module is used for inputting an instruction text corresponding to the voice instruction to a pre-trained instruction type recognition model before acquiring trigger condition information and target execution information corresponding to the voice instruction when the instruction type of the voice instruction is a non-immediate type, so as to obtain the instruction type of the voice instruction, wherein the instruction type recognition model is used for recognizing that the instruction type corresponding to the input instruction text is an immediate type or a non-immediate type.
In a possible implementation manner, the type identification module may be further specifically configured to: performing preset correction processing on an instruction text corresponding to the voice instruction; and inputting the preset corrected instruction text into a pre-trained instruction type recognition model to obtain the instruction type of the voice instruction.
In some embodiments, the information obtaining module 420 may be further configured to, after the obtaining of the voice instruction, identify target operation information corresponding to the voice instruction if the instruction type of the voice instruction is an instant type; the operation executing module 440 may be further configured to, in response to the target operation information, execute an operation corresponding to the target operation information on an interface corresponding to the target operation information.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
In summary, according to the scheme provided by the application, by obtaining the voice instruction, when the instruction type of the voice instruction is a non-instant type, the trigger condition information and the target execution information corresponding to the voice instruction are obtained, and when the trigger condition corresponding to the trigger condition information is met, the target operation is executed on the target operation interface corresponding to the target execution information. Therefore, after the trigger condition information of the non-instant voice command input by the user is identified, the corresponding interface operation is executed according to the trigger condition information, so that the non-instant voice command can be better completed, and the user experience is further improved.
Referring to fig. 15, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 100 may be an electronic device capable of running an application, such as a smart phone, a tablet computer, a smart watch, smart glasses, and a notebook computer. The electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more programs configured to perform a method as described in the aforementioned method embodiments.
Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.
The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.
Referring to fig. 16, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 800 has stored therein a program code that can be called by a processor to execute the method described in the above-described method embodiments.
The computer-readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 800 includes a non-volatile computer-readable storage medium. The computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 810 may be compressed, for example, in a suitable form.
The embodiment of the present application further provides a computer program product, which includes a computer program/instruction, and is characterized in that the computer program/instruction, when executed by a processor, implements the voice control method provided in the foregoing embodiment.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (16)

1. A method for voice control, the method comprising:
acquiring a voice instruction;
when the instruction type of the voice instruction is a non-instant type, acquiring trigger condition information and target execution information corresponding to the voice instruction;
and when the trigger condition corresponding to the trigger condition information is met, executing target operation on a target operation interface corresponding to the target execution information.
2. The method according to claim 1, wherein when the trigger condition corresponding to the trigger condition information is satisfied, executing a target operation on a target interface corresponding to the target execution information includes:
and when the trigger condition corresponding to the trigger condition information is met, matching the target execution information with an interface operable element to obtain the target operation interface, and executing the target operation on the target operation interface.
3. The method according to claim 2, wherein the matching the target execution information with interface operable elements to obtain the target operation interface, and executing a target operation on the target operation interface comprises:
matching the target execution information with interface operable elements to obtain the matched interface operable elements in the target operation interface as target operable elements;
and executing the operation corresponding to the target operable element on the target interface.
4. The method according to claim 1, wherein before the executing a target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is satisfied, the method further comprises:
and matching the target execution information with interface operable elements to obtain the target operation interface.
5. The method according to claim 4, wherein when the trigger condition corresponding to the trigger condition information is satisfied, executing a target operation on a target operation interface corresponding to the target execution information includes:
generating a corresponding control instruction according to the trigger condition information and the target operation interface;
and executing the control instruction, wherein the control instruction is used for executing target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
6. The method according to claim 1, wherein the acquiring trigger condition information and target execution information corresponding to the voice instruction comprises:
acquiring an instruction text corresponding to the voice instruction;
and acquiring trigger condition information and target execution information contained in the instruction text.
7. The method according to claim 6, wherein the acquiring trigger condition information and target execution information contained in the instruction text comprises:
and inputting an instruction text corresponding to the voice instruction into a pre-trained instruction recognition model to obtain triggering condition information and target execution information contained in the instruction text, wherein the instruction recognition model is obtained by training based on a layered reinforcement learning mode.
8. The method according to claim 7, wherein the instruction recognition model includes a first sub-module corresponding to trigger condition information, a second sub-module corresponding to target execution information, and a cooperative control module, and the training process of the instruction recognition model includes:
creating a first sub-module corresponding to an identification task for identifying trigger condition information, a second sub-module corresponding to an identification task for identifying target execution information, and a cooperative control module for coordinating the identification task, wherein the first sub-module and the second sub-module are used for deciding actions of the identification tasks corresponding to the first sub-module and the second sub-module, and the decision priority of the cooperative control module is higher than the decision priority of the first sub-module and the second sub-module;
and performing deep reinforcement learning training on the cooperative control module, the first sub-module and the second sub-module based on the text sample labeled with the triggering condition information, the target execution information and the recognition sequence of the recognition task to obtain the trained instruction recognition model.
9. The method according to claim 8, wherein the deep reinforcement learning training of the cooperative control module, the first sub-module, and the second sub-module based on the text samples labeled with trigger condition information, target execution information, and the recognition order of the recognition task to obtain the trained instruction recognition model comprises:
inputting the text sample to the cooperative control module, the first submodule and the second submodule to obtain output results of the first submodule and the second submodule and a coordinated execution sequence of the cooperative control module;
determining a first reward corresponding to the first sub-module based on the output result of the first sub-module and the marked triggering condition information of the text sample;
determining a second reward corresponding to the second sub-module based on the output result of the second sub-module and the labeled target execution information of the text sample;
determining a third reward corresponding to the cooperative control module based on the execution sequence coordinated by the cooperative control module and the identification sequence marked by the text sample;
and carrying out deep reinforcement learning training on the first submodule based on the first reward, carrying out deep reinforcement learning training on the second submodule based on the second reward, and carrying out deep reinforcement learning training on the cooperative control module based on the third reward until a preset termination condition is met, so as to obtain the trained instruction recognition model.
10. The method according to any one of claims 1 to 9, wherein before the obtaining of the trigger condition information and the target execution information corresponding to the voice instruction when the instruction type of the voice instruction is a non-immediate type, the method further comprises:
and inputting the instruction text corresponding to the voice instruction into a pre-trained instruction type recognition model to obtain the instruction type of the voice instruction, wherein the instruction type recognition model is used for recognizing whether the instruction type corresponding to the input instruction text is an instant type or a non-instant type.
11. The method according to claim 10, wherein before the inputting the instruction text corresponding to the voice instruction into a pre-trained instruction type recognition model to obtain the instruction type of the voice instruction, the method further comprises:
performing preset correction processing on an instruction text corresponding to the voice instruction;
the step of inputting the instruction text corresponding to the voice instruction into a pre-trained instruction type recognition model to obtain the instruction type of the voice instruction comprises:
and inputting the preset corrected instruction text into a pre-trained instruction type recognition model to obtain the instruction type of the voice instruction.
12. The method of any of claims 1-9, wherein after the obtaining the voice instruction, the method further comprises:
if the instruction type of the voice instruction is an instant type, identifying target operation information corresponding to the voice instruction;
and responding to the target operation information, and executing the operation corresponding to the target operation information on the interface corresponding to the target operation information.
13. A voice control apparatus, characterized in that the apparatus comprises: an instruction acquisition module, an information acquisition module and an operation execution module, wherein,
the instruction acquisition module is used for acquiring a voice instruction;
the information acquisition module is used for acquiring trigger condition information and target execution information corresponding to the voice instruction when the instruction type of the voice instruction is a non-instant type;
and the operation execution module is used for executing target operation on a target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
14. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-12.
15. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 12.
16. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method according to any of claims 1-12.
CN202111433111.8A 2021-11-29 2021-11-29 Voice control method and device, electronic equipment and storage medium Pending CN114121005A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111433111.8A CN114121005A (en) 2021-11-29 2021-11-29 Voice control method and device, electronic equipment and storage medium
PCT/CN2022/121695 WO2023093280A1 (en) 2021-11-29 2022-09-27 Speech control method and apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111433111.8A CN114121005A (en) 2021-11-29 2021-11-29 Voice control method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114121005A true CN114121005A (en) 2022-03-01

Family

ID=80371572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111433111.8A Pending CN114121005A (en) 2021-11-29 2021-11-29 Voice control method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114121005A (en)
WO (1) WO2023093280A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093280A1 (en) * 2021-11-29 2023-06-01 Oppo广东移动通信有限公司 Speech control method and apparatus, electronic device, and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4840149B2 (en) * 2007-01-12 2011-12-21 ヤマハ株式会社 Sound signal processing apparatus and program for specifying sound generation period
CN103973870B (en) * 2013-01-28 2017-02-08 联想(北京)有限公司 Information processing device and information processing method
CN105049963A (en) * 2015-07-31 2015-11-11 小米科技有限责任公司 Terminal control method and device, and terminal
CN108597499B (en) * 2018-04-02 2020-09-25 联想(北京)有限公司 Voice processing method and voice processing device
CN110459222A (en) * 2019-09-06 2019-11-15 Oppo广东移动通信有限公司 Sound control method, phonetic controller and terminal device
CN113641114A (en) * 2020-04-27 2021-11-12 青岛海尔空调器有限总公司 Environment control method and system for intelligent getting-up scene
CN112415908A (en) * 2020-11-26 2021-02-26 珠海格力电器股份有限公司 Intelligent device control method and device, readable storage medium and computer device
CN114121005A (en) * 2021-11-29 2022-03-01 Oppo广东移动通信有限公司 Voice control method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093280A1 (en) * 2021-11-29 2023-06-01 Oppo广东移动通信有限公司 Speech control method and apparatus, electronic device, and storage medium

Also Published As

Publication number Publication date
WO2023093280A1 (en) 2023-06-01

Similar Documents

Publication Publication Date Title
JP6811758B2 (en) Voice interaction methods, devices, devices and storage media
CN107644642B (en) Semantic recognition method and device, storage medium and electronic equipment
CN112513833A (en) Electronic device and method for providing artificial intelligence service based on presynthesized dialog
CN110998720A (en) Voice data processing method and electronic device supporting the same
CN111261151B (en) Voice processing method and device, electronic equipment and storage medium
US20190221208A1 (en) Method, user interface, and device for audio-based emoji input
CN111105800B (en) Voice interaction processing method, device, equipment and medium
CN111312235A (en) Voice interaction method, device and system
WO2020024620A1 (en) Voice information processing method and device, apparatus, and storage medium
CN110287303B (en) Man-machine conversation processing method, device, electronic equipment and storage medium
CN111312233A (en) Voice data identification method, device and system
WO2023082703A1 (en) Voice control method and apparatus, electronic device, and readable storage medium
CN110992955A (en) Voice operation method, device, equipment and storage medium of intelligent equipment
WO2022134110A1 (en) Speech comprehension method and device
WO2023093280A1 (en) Speech control method and apparatus, electronic device, and storage medium
CN111210824B (en) Voice information processing method and device, electronic equipment and storage medium
KR20210001082A (en) Electornic device for processing user utterance and method for operating thereof
WO2023103917A1 (en) Speech control method and apparatus, and electronic device and storage medium
WO2023103918A1 (en) Speech control method and apparatus, and electronic device and storage medium
KR20210066651A (en) Electronic device and Method for controlling the electronic device thereof
CN113643706B (en) Speech recognition method, device, electronic equipment and storage medium
CN115062131A (en) Multi-mode-based man-machine interaction method and device
CN109255131B (en) Translation method, translation device, translation terminal and storage medium
CN111883126A (en) Data processing mode selection method and device and electronic equipment
CN112527975A (en) Human-computer interaction method and device, intelligent robot and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination