CN114049892A

CN114049892A - Voice control method and device and electronic equipment

Info

Publication number: CN114049892A
Application number: CN202111340937.XA
Authority: CN
Inventors: 曾理; 张晓帆
Original assignee: Hangzhou Douku Software Technology Co Ltd
Current assignee: Hangzhou Douku Software Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-02-15
Also published as: WO2023082703A1

Abstract

The embodiment of the application discloses a voice control method and device and electronic equipment. The method comprises the following steps: acquiring a voice instruction; acquiring a voice instruction; identifying a current user interface to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received; matching the voice instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface; and executing the target operation in response to the determined target operable information. Therefore, the voice instruction triggered by the user in real time aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired.

Description

Voice control method and device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a voice control method and apparatus, and an electronic device.

Background

The combination of artificial intelligence technology and virtual personal assistant (voice assistant) can make the electronic device receive the voice command issued by the user through the hearing modality and complete the corresponding interactive task. However, in many cases, the user will not make clear his/her own interactive intention after seeing the interactive interface, and wants to directly operate on the viewed interactive interface or the objects therein. However, the voice assistant cannot better complete the voice command triggered by the user in real time for the seen interactive interface.

Disclosure of Invention

In view of the foregoing, the present application provides a voice control method, an apparatus and an electronic device to achieve improvement of the foregoing problems.

In a first aspect, the present application provides a method for voice control, the method comprising: acquiring a voice instruction; identifying a current user interface to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received; matching the voice instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface; and executing the target operation in response to the determined target operable information.

In a second aspect, the present application provides a voice-controlled apparatus, the apparatus comprising: the voice instruction conversion unit is used for acquiring a voice instruction; the operation information acquisition unit is used for identifying a current user interface to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received; the target information acquisition unit is used for matching the voice instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface; an operation execution unit for executing a target operation in response to the determined target operable information.

In a third aspect, the present application provides an electronic device comprising one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium having a program code stored therein, wherein the program code performs the above method when running.

According to the voice control method, the voice control device and the electronic equipment, after a voice instruction and operable information corresponding to a current user interface are obtained, the voice instruction is matched with the operable information corresponding to the current user interface, so that target operable information is determined from the operable information corresponding to the current user interface, and target operation is executed in response to the determined target operable information. Therefore, the voice instruction triggered by the user in real time aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating an application scenario of a speech control method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating an application scenario of another speech control method proposed in an embodiment of the present application;

fig. 3 is a flowchart illustrating a voice control method according to an embodiment of the present application;

FIG. 4 is a flow chart illustrating a voice control method according to another embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a gridding arrangement of controls in an embodiment of the present application;

fig. 6 shows a flow of a voice control method according to still another embodiment of the present application;

fig. 7 shows a flow of a voice control method according to another embodiment of the present application;

FIG. 8 is a flow chart illustrating a voice control method according to another embodiment of the present application;

FIG. 9 is a flow chart illustrating a voice control method according to yet another embodiment of the present application;

FIG. 10 is a flow chart illustrating a voice control method according to yet another embodiment of the present application;

FIG. 11 is a flow chart illustrating a voice control method according to yet another embodiment of the present application;

fig. 12 is a block diagram illustrating a structure of a target recognition apparatus according to an embodiment of the present application;

fig. 13 is a block diagram illustrating an electronic device according to the present application;

fig. 14 is a storage unit for storing or carrying program codes for implementing a voice control method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The popularization of intelligent terminal equipment brings various conveniences to life. The electronic equipment can receive voice instructions sent by a user through an auditory modality and complete corresponding interactive tasks by combining artificial intelligence technology and virtual personal assistants (voice assistants). However, in many cases, the user will not make clear his/her own interactive intention after seeing the interactive interface, and wants to directly operate on the viewed interactive interface or the objects therein.

However, the inventor finds in research that the relevant voice assistant cannot better complete the voice command triggered by the user in real time for the seen interactive interface. In particular, the associated voice assistant can usually perform good recognition on the preset voice command, so as to complete the corresponding operation. However, if the command issued by the user does not belong to the preset voice command, the electronic device cannot understand the intention of the user, and thus cannot smoothly execute the corresponding operation.

Therefore, the inventor proposes a voice control method, a voice control device and an electronic device in the present application, in which after a voice instruction is acquired and operable information corresponding to a current user interface is acquired, the voice instruction is matched with the operable information corresponding to the current user interface to determine target operable information from the operable information corresponding to the current user interface, and a target operation is executed in response to the determined target operable information. Therefore, the voice instruction triggered by the user in real time aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired.

The following first introduces an application scenario related to the embodiment of the present application.

In the embodiment of the application, the provided voice control method can be executed by the electronic equipment. In this manner performed by the electronic device, all steps in the voice control method provided by the embodiment of the present application may be performed by the electronic device. For example, as shown in fig. 1, a voice instruction may be acquired by a voice acquisition device of the electronic device 100, and then both the acquired voice instruction and the current user interface are transmitted to the processor, so that the processor may identify the current user interface in real time to obtain the operable information of the current user interface, and then the processor may execute the steps related to the voice control method provided by the present application by using the acquired voice instruction and the operable information corresponding to the current user interface.

Moreover, the voice control method provided by the embodiment of the application can also be executed by a server (cloud). Correspondingly, in the mode executed by the server, the electronic device may collect the voice command, synchronously send the collected voice command and the current user interface to the server, then the server identifies the current user interface in real time to obtain the operable information of the current user interface, and then the server triggers the electronic device to execute the target operation.

In addition, the method can be executed by cooperation of the electronic device and the server. In this manner, the electronic device and the server cooperatively perform some steps in the voice control method provided by the embodiment of the present application, and some other steps are performed by the electronic device and the server.

For example, as shown in fig. 2, the electronic device 100 may perform a voice control method including: the method comprises the steps of obtaining a voice instruction and obtaining operable information corresponding to a current user interface, then matching the voice instruction with the operable information corresponding to the current user interface to determine target operable information from the operable information corresponding to the current user interface, and executing target operation based on the target operable information, wherein the voice instruction is obtained by the server 200.

It should be noted that, in this manner executed by the electronic device and the server cooperatively, the steps executed by the electronic device and the server respectively are not limited to the manner described in the above example, and in practical applications, the steps executed by the electronic device and the server respectively may be dynamically adjusted according to actual situations.

It should be noted that the electronic device 100 may be a car-mounted device, a wearable device, a tablet computer, a notebook computer, a smart sound box, or the like, in addition to the smart phone shown in fig. 1 and fig. 2. The server 120 may be an independent physical server, or may be a server cluster or distributed system formed by a plurality of physical servers

Embodiments of the present application will be described with reference to the accompanying drawings.

Referring to fig. 3, a voice control method provided in the present application includes:

s110: and acquiring a voice instruction.

In the embodiment of the application, the user can express own control intention through voice. Correspondingly, the electronic device can take the voice sent by the user as the voice instruction.

S120: and identifying a current user interface to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received.

In this embodiment, the current user interface may be a user interface displayed when the voice instruction is received. For example, the current user interface may be a desktop of the electronic device, or may be an interface of an application program currently running on the electronic device. Moreover, in the case that the electronic device displays a desktop, if an interface of an application program is also displayed on the desktop in a floating manner, the current user interface may include the desktop, may also include an interface of the application program, and may also include both the desktop and the interface of the application program. For example, if the electronic device is currently displaying a desktop, the current user interface may be the desktop of the electronic device. For another example, if the electronic device currently displays a desktop and a video playing interface is displayed on the desktop in a floating manner, the current user interface may be the video playing interface.

It should be noted that a plurality of operations may be included in one interface. The operation may include an operation for a certain control in the interface, or may be an operation for the entire interface. For example, if the user intends to perform a page sliding (e.g., a slide up, a slide down, a slide left, and a slide right), or intends to perform an interface switching, or to exit from a currently displayed interface, the operation is performed on the entire interface. For another example, if the user's intent is to click on a location in the interface, then there may be an operation on a control in the interface. In the embodiment of the present application, the operational information corresponding to the interface is used to describe the operation that can be performed on the interface. In addition, when there are a plurality of operations to be performed on the interface, there may be a plurality of operational information corresponding to the interface.

The method for identifying the current user interface at least comprises at least one of the following identification methods: identifying the current user interface based on a code analysis mode; identifying the current user interface based on a picture and text identification mode; and identifying the current user interface based on a control classification model.

As one mode, the training process of the control classification model includes: acquiring a user interface; acquiring controls classified from the user interface; and training the neural network model to be trained through the classified control to obtain a control classification model.

S130: and matching the voice instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface.

After the electronic device obtains the voice command, as a way, the voice command may be converted into corresponding control information. In the embodiment of the present application, the control information may be understood as information that is obtained by converting a voice instruction by the electronic device and is used for representing a control intention of a user.

In addition, in the embodiment of the present application, the content included in the control information may have various forms, and correspondingly, there are various ways of acquiring the control information.

As one mode, the text contents converted by the voice command may be used as the control information. In this way, the electronic device may convert the voice command into corresponding text content based on a pre-configured Automatic Speech Recognition mode (Automatic Speech Recognition) after receiving the voice control. For example, if the received voice command is "open album", the control information obtained by converting the voice command includes "open album".

Alternatively, the text content obtained by converting the voice command may be subjected to semantic recognition based on a pre-configured manner, and the result of the semantic recognition may be used as the control information. Optionally, the intention, the control object, and the object auxiliary information in the text may be extracted based on a Natural Language Understanding (NLU) manner, and integrated into a triple with a style of { action, object, information }, in which the result of semantic recognition is the triple. Wherein, the action represents the intention, or can be understood as the control purpose, the object represents the control object, and the information represents the object auxiliary information. For example, the text content obtained by converting the voice command is "play-show-story". The way of understanding based on natural language can be understood that the user intention is: "play". The control object is 'a display instruction', the object auxiliary information is null, and the triple is recorded as: { Play, Chen-Emotion, Φ }. For another example, the text content obtained by converting the voice command is "help me search for the antique bureau, the intention is" search ", the control object is" search ", the object auxiliary information is" antique bureau ", and the triple is: { search, antique bureau-in-bureau }.

It should be noted that, in the process of making a voice, the user may make the made voice more random due to the pronunciation habit problem, but the voice instruction corresponding to the more random voice may not enable the electronic device to accurately determine the control intention of the user. For example, if the content corresponding to the voice command itself is "next", the meaning corresponding to the next possible meaning may be next one, and the corresponding meaning may also be downloaded one. For example, the next possible meaning in an audio playback scenario may be the next one, e.g., playing the next song. In the software downloading scenario, the next possible corresponding meaning may be downloading one. For example, an application is downloaded.

In order to more accurately determine the real intention of a user, as a mode, after a voice instruction is obtained, the voice instruction is updated according to a task scene corresponding to a current user interface to obtain a scene control instruction; and matching the scene control instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface. For example, after the electronic device obtains a voice command with a content of the next music, the text content corresponding to the obtained voice command may be "next music". Moreover, the electronic device may also detect a task scene corresponding to the current user interface, and if it is determined that the task scene corresponding to the current user interface is an audio playing scene, the next music may be updated, and the updated text content may be "next music", so that the obtained scene control instruction is the next music. If the task scene corresponding to the current user interface is determined to be an application program downloading scene, the next music can be updated, the updated text content is 'download a music playing program', and the corresponding obtained scene control instruction can be download a music playing program.

It should be noted that, in the case that the voice control information is in the form of the foregoing triplet, the electronic device may obtain the triplet based on a scene control instruction obtained according to a task scene corresponding to the current user interface.

After the control information and the operable information corresponding to the current user interface are obtained, the control information and the operable information of the current user interface can be matched, so that the target operable information can be determined from the operable information corresponding to the current user interface.

It should be noted that, as described in the foregoing, the control information may have various forms in the embodiment of the present application. And for different forms of control information, the way of determining the successful matching with the operational information is correspondingly different. If the control information is in the form of a triple, in the process of matching the control information with the operable information, a control object included in the control information may be matched with the operable information, and in the case that the control object is the same as the operable information, it is determined that the control information to which the control object belongs is successfully matched with the operable information. If the control information is text content directly converted to the voice command, it can be determined that certain operable information is successfully matched with the control information under the condition that the operable information is detected to be included by the text content.

Wherein the target operable information is operable information for successfully matching with the operation intention of the user. After the target operable information is obtained, a control instruction for the current user interface can be generated. It should be noted that, in the process of generating the control command, the control command may be generated based on a mode supported by the electronic device. For example, the control instruction corresponding to the current user interface may be generated by system injection (an operation mode supported by Android) or a method of simulating screen click.

S140: and executing the target operation in response to the determined target operable information.

As one way, after the target operational information is determined, control instructions executable by the electronic device may be generated according to the target operational information, and the target operation may be executed by executing the control instructions. For example, if the target operable information is description information for performing a designated overall operation on the interface, a control instruction for performing the designated overall operation is generated, and the control instruction is executed.

In the voice control method provided by this embodiment, after a voice instruction and operational information corresponding to a current user interface are acquired, the voice instruction is matched with the operational information corresponding to the current user interface, so as to determine target operational information from the operational information corresponding to the current user interface, and a target operation is executed in response to the determined target operational information. Therefore, the voice instruction triggered by the user in real time aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired.

Referring to fig. 4, a voice control method provided in the present application includes:

s210: and acquiring a voice instruction.

S220: and identifying a current user interface based on a code analysis mode to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received.

In this embodiment, the current user interface may be identified based on a code parsing manner to obtain the operable information corresponding to the current user interface. Furthermore, in this embodiment, the recognizing the current user interface may be understood as recognizing a control included in the current user interface, and the obtained operational information may include an identifier and description information of the recognizable control. Correspondingly, the current user interface is identified based on the code analysis mode, which can be understood as obtaining the control included in the current user interface and the description information corresponding to the control based on the code analysis mode.

S230: and detecting whether the operable information successfully matched with the voice instruction exists in the operable information.

S240: and if the operable information has operable information successfully matched with the voice instruction, taking the operable information successfully matched with the voice instruction in the operable information as target operable information.

S250: and if the operable information does not have operable information successfully matched with the voice instruction, identifying the current user interface based on a picture-text identification mode to obtain operable information, and obtaining target operable information based on the operable information obtained by identifying the current user interface in the picture-text identification mode.

As a mode, the identifying the current user interface based on the image-text identification mode to obtain the operable information, and obtaining the target operable information based on the operable information obtained by identifying the current user interface based on the image-text identification mode includes: and if the operational information obtained by identifying the image-text identification mode of the current user interface comprises operational information successfully matched with the control instruction, taking the operational information successfully matched with the control instruction as target description information. And if no operable information successfully matched with the control instruction exists in the operable information obtained by identifying the current user interface in an image-text identification mode, identifying the current user interface based on a control classification model to obtain operable information, and obtaining target operable information based on the operable information obtained by identifying through the control classification model.

Optionally, the operational information obtained by performing image-text recognition on the current user interface in a manner of image-text recognition includes the position and description information of the recognized control, and the executing the target operation in response to the determining of the target operational information includes: in response to the fact that target operable information is determined from the operable information obtained through image-text recognition, taking a control corresponding to the target operable information as a target control; and generating a control instruction corresponding to the target control based on a click simulation mode and the control purpose of the voice instruction, and executing the control instruction.

Optionally, the operational information obtained by identifying the current user interface based on the code analysis mode includes the identification and the description information of the identified control; said performing a target operation in response to said determining target operational information, comprising: in response to the fact that target operable information is determined from the operable information obtained through code analysis mode identification, taking a control corresponding to the target operable information as a target control; and generating a control instruction based on the control purpose corresponding to the voice instruction and the identification of the target control.

S250: and executing the target operation in response to the determined target operable information.

By one approach, the actionable information includes a location of a control identified from the current user interface and description information; the method further comprises the following steps: if the voice instruction comprises position information, matching the position of the control contained in the operable information with the position information contained in the voice instruction; taking the control of the position successfully matched with the position information as a target control; in response to determining the target control, performing a target operation. In this way, after the electronic device obtains the voice command, the electronic device may query whether the voice command includes the content in the location information thesaurus, and if the voice command includes the content in the location information thesaurus, it is determined that the voice command includes the location information. The position information may be the top left corner, the top right corner, the bottom left corner, the bottom right corner, or the X-th row, and so on. For example, if the text content obtained by converting the voice command is "help me open the 1 st row 3 rd program", it may be determined that the voice command includes the position information.

In this manner, the electronic device may identify the current user interface, and the identified control may correspond to a location, and the identified location may be in the form of coordinates. In this case, the identified controls may be arranged in a grid based on the positions of the identified controls, so that each control may correspond to one grid of position information. Wherein, the gridded position information can be understood as the row and the number of the control in the current user interface, or the control is positioned at a certain corner of the current user interface.

As shown in fig. 5, for the current user interface 10 displayed by the electronic device, the controls that can be obtained after the current user interface 10 is identified may include a control whose description information includes a lunar calendar, a control whose description information includes weather, a control whose description information includes a clock, a control whose description information includes settings, a control whose description information includes an album, a control whose description information includes a memo, a control whose description information includes a video, and a control whose description information includes a camera. And, the location of the identified control in the form of coordinates is also obtained. For the positions in the coordinate form, the abscissa and the ordinate may be included, it may be understood that the arrangement position in the current user interface may be relatively more left for the control with the larger abscissa, and correspondingly, the arrangement position in the current user interface may be relatively more top for the control with the larger ordinate, so that the electronic device may determine the grid arrangement shown in fig. 5.

According to the voice control method provided by the embodiment, the voice instruction triggered in real time by the user aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired. In addition, in this embodiment, the operational information includes the identifier of the control identified in the current user interface and the description information, so that the control information and the description information can be matched to determine the target control from the controls included in the current user interface, so as to generate the control instruction corresponding to the target control.

Referring to fig. 6, a voice control method provided in the present application includes:

s310: and acquiring an instruction text corresponding to the voice instruction, and acquiring a control purpose, a control object and object auxiliary information based on the instruction text.

S320: and acquiring operable information corresponding to a current user interface, wherein the current user interface is an interface displayed when the voice instruction is acquired, the operable information comprises an identifier and description information of a control identified from the current user interface, and the operable information is information for describing an operation corresponding to the current user interface.

S330: and matching the control object with the operable information.

S340: and taking the operable information successfully matched with the control object as target operable information.

S350: and executing the target operation in response to the determined target operable information.

As a mode, after determining the target operable information, a control corresponding to the target operable information may be used as a target control, and a control instruction corresponding to the target control is generated based on a control purpose corresponding to the voice instruction and an identifier of the target control.

Optionally, in this embodiment, controlling the object in the triple enables the electronic device to more accurately determine which control in the current user interface the user wants to control. In the present embodiment, the control object may be matched with the description information, thereby more accurately performing the control intention of the user. In the manner of matching the control object with the description information included in the operational information, the description information successfully matched with the control object may be directly used as the target description information, and then the control corresponding to the target description information may be used as the target control (the operational information to which the target description information belongs may be used as the target operational information).

For example, the triple obtained after the voice instruction of the user is converted includes { play, stale, Φ }, and the obtained operable information corresponding to the current user interface includes control 1 and the descriptive information corresponding to control 1 includes stale, and further includes control 2 and the descriptive information corresponding to control 1 includes antique bureau-in-bureau. Then, after the control object (the display order) obtained in this case is respectively matched with the description information corresponding to each of the control 1 and the control 2, it can be determined that the description information corresponding to the control 1 can be successfully matched with the control object, and then the control 1 can be determined as the target control.

According to the voice control method provided by the embodiment, the voice instruction triggered in real time by the user aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired. In addition, in this embodiment, the voice instruction is converted into a corresponding instruction text, and then the control purpose, the control object, and the object attached information are extracted from the instruction text, so that the control object is used to match the description information of the control, thereby improving the accuracy of the determined target control.

Referring to fig. 7, a voice control method provided in the present application includes:

s410: and acquiring a voice instruction.

S420: and identifying the current user interface based on a code analysis mode to acquire operable information, wherein the operable information at least comprises an identification and description information of a control which can be identified based on the code analysis mode, and the current user interface is an interface displayed when the voice instruction is acquired.

Optionally, in this embodiment of the application, the current user interface may be identified based on a code parsing manner based on Google barrier-free service access.

S430: if the operable information successfully matched with the control information exists in the operable information, the operable information successfully matched with the control information in the operable information is used as target description information, a control corresponding to the target description information is used as a target control, and a control instruction corresponding to the target control is generated based on a control purpose corresponding to the voice instruction and the identification of the target control.

S440: and executing the control instruction.

As shown in fig. 8, the flow of the voice control method according to the present embodiment will be described.

As shown in fig. 8, for the obtained speech, speech recognition may be performed first to obtain an instruction text (i.e., text content), and then the specified text is processed based on natural language understanding to obtain a triplet through conversion. And in the process of processing the user voice, the current user interface is synchronously processed. For example, an interactive interface understanding can be made of the current user interface. The method for understanding the interactive interface may include understanding an element code of the interface, or may also include understanding a screenshot corresponding to the current user interface to obtain an interface element list. The interface element list may include operable information corresponding to the interface. After the triple and the interface element list are obtained, the triple and the interface element list can be matched, and then operation generation is performed. It is understood that the generated operation may be an operation determined based on the target control, and then a control instruction is generated based on the generated operation.

According to the voice control method provided by the embodiment, the voice instruction triggered in real time by the user aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired. In addition, in this embodiment, the current user interface may be identified based on a code analysis manner, so that the control in the current user interface may not be labeled and recorded in advance, but may be identified in real time and dynamically, which improves flexibility of identifying the current user interface, and also reduces occupation degree of storage space and labor cost because the control in the current user interface is not labeled and recorded in advance.

Referring to fig. 9, a voice control method provided in the present application includes:

s510: and acquiring a voice instruction.

S520: and identifying the current user interface based on a code analysis mode to acquire first operable information, wherein the first operable information at least comprises an identifier and description information of a control which can be identified based on the code analysis mode, and the current user interface is an interface displayed when the voice instruction is acquired.

Optionally, the first operation information may include, in addition to the identification and description information of the recognizable control, the type and size of the recognized control.

S530: and detecting whether the operable information successfully matched with the control information exists in the first operable information.

S531: if the first operable information contains operable information successfully matched with the control information, the operable information successfully matched with the control information in the first operable information is used as target description information, a control corresponding to the target description information is used as a target control, and a control instruction corresponding to the target control is generated based on a control purpose corresponding to the voice instruction and the identification of the target control.

S540: and if the first operable information does not have operable information successfully matched with the control information, identifying the current user interface based on a picture-text identification mode and obtaining second operable information, wherein the second operable information at least comprises the position and the description information of a control which can be identified based on the picture-text identification mode.

The image-text Recognition mode may include an OCR (Optical Character Recognition) mode.

S550: and detecting whether the second operable information has operable information which is successfully matched with the control information.

S551: if the second operable information contains operable information successfully matched with the control information, the operable information successfully matched with the control information in the second operable information is used as target description information; and taking the control corresponding to the target description information as a target control, and generating a control instruction corresponding to the target control based on a click simulation mode and the control purpose of the specified text.

S560: and executing the control instruction.

Furthermore, as a mode, the method provided in this embodiment further includes:

s552: and if the second operable information does not have operable information successfully matched with the control information and the control purpose corresponding to the voice instruction is to perform specified broadcast control operation, identifying a control of the broadcast control class included in the current user interface based on a control identification mode.

S553: and acquiring a target control from the control of the broadcast control class included in the current user interface based on the control purpose corresponding to the control information, and generating a control instruction corresponding to the target control based on the control purpose corresponding to the voice instruction.

As a mode, the obtaining of the operable information corresponding to the current user interface, where the current user interface is an interface displayed when the voice instruction is obtained, includes: and if the control purpose corresponding to the voice instruction is to perform specified broadcast control operation, identifying the operable information of the control of the broadcast control class included in the current user interface based on a control identification mode.

It should be noted that, for some applications or controls in the interface, information of all the controls may not be obtained in a code parsing manner, and the control that the user intends to operate may be exactly the unidentified control, which may cause the electronic device to be unable to smoothly execute the user's intention. In this case, the control is further identified in an image-text identification manner, which is beneficial to obtaining the control intended by the user with a higher probability. Moreover, the image-text recognition mode determines the position and the description information of the control by recognizing the text configured in the control, and for some controls (such as playing, pausing, sharing, collecting) of the playing and controlling class, the text is not configured correspondingly, so that the controls of the classes cannot be recognized by the image-text recognition mode. Therefore, the recognition of the broadcast control type graph control is carried out in a control recognition mode, and the control which is intended to be controlled by the user can be acquired with higher probability.

Next, a speech control method according to the present embodiment will be described with reference to fig. 10.

As shown in fig. 10, the electronic device may begin voice activation detection. It can be understood that if the electronic device always converts the voice uttered by the user in real time and matches the operable information corresponding to the interface, a large waste of resources may be caused, and a meaningless operation may also be caused. In this case, the user may trigger the electronic device to start executing the voice control method according to the embodiment of the present application through the predetermined voice content. Therefore, after the electronic equipment receives the preset voice content, the electronic equipment determines that the user expects to trigger the electronic equipment to execute corresponding operation through the voice command, then the electronic equipment takes the content sent by the user after sending the preset voice content as the voice command, and then automatic voice recognition is carried out based on the voice command so as to convert the voice command into a command text.

Wherein, natural speech understanding can be carried out on the instruction text to obtain the triples. And then, acquiring a user interface (current user interface), matching the acquired interface with the triple to obtain a control instruction, and executing the control instruction. The specific content of the matching may include the description information obtained in the three ways involved in this embodiment.

And if the electronic equipment does not receive the voice sent by the user within the specified time span after the preset voice content is acquired, determining that the time is out and exiting the voice control method.

It should be noted that, in the embodiment of the present application, in a case that the description information of the control can be obtained through multiple manners, the electronic device may simultaneously trigger at least two manners to obtain the description information of the control under a condition that the computing resource is sufficient, and simultaneously trigger to respectively determine the target control based on the description information of the control obtained by the at least two manners. If the description information of the control acquired in one mode is already used for determining the target control, the process of determining the target control by the description information of the control acquired in other modes, which is triggered at the same time, is stopped.

According to the voice control method provided by the embodiment, the voice instruction triggered in real time by the user aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired. In addition, in this embodiment, if there is no operational information successfully matched with the control information in the first operational information, the current user interface may be further identified and second operational information may be obtained based on a text-text recognition mode, and then the target control may be obtained by using the second operational information, so that the current user interface is identified based on a code analysis mode and the current user interface is identified based on the text-text recognition mode, and a voice instruction is executed more accurately. Moreover, if the second operable information does not have the operable information successfully matched with the control information, the control intention of the voice instruction of the user can be further determined by combining a control recognition mode, and the accuracy of executing the voice instruction is further improved.

Referring to fig. 11, a voice control method provided in the present application includes:

s610: and acquiring a voice instruction.

S620: and acquiring operable information corresponding to a current user interface, wherein the current user interface is an interface displayed when the voice instruction is acquired, the operable information is information for describing an operation corresponding to the current user interface, and the operable information at least comprises description information of a control in the current user interface.

S630: and acquiring the similarity between the voice instruction and the description information of the control included in the current user interface.

S640: and taking the description information of which the corresponding similarity meets the similarity condition as the target description information.

S650: and executing the target operation in response to the determined target description information.

It should be noted that, in the embodiment of the present application, the voice command may be converted into the control information shown in the foregoing embodiment, and then the similarity between the control information and the description information may be compared. Moreover, because the implementation of the control information has multiple forms, in the process of determining the similarity in the present embodiment, different forms of control information correspond to different ways of determining the similarity.

As one mode, if the control information is text content obtained by converting a voice instruction, the similarity may be determined by determining the number of the same characters included in the control information and the description information of the control. Correspondingly, the greater the number of identical characters included in common, the greater the degree of similarity.

Alternatively, if the control information is a triplet extracted from the text content converted based on the voice command. Correspondingly, the obtaining of the similarity may include calculating a vector distance between the control object in the triple and the control description information, and taking the calculated vector distance as the corresponding similarity. In this way, the text vector corresponding to the control object and the text vector corresponding to the description information may be obtained first, and then the vector distance between the text vector corresponding to the control object and the text vector corresponding to the description information may be calculated. The vector distance may be calculated by using an euler distance calculation method or a cosine distance calculation method. Moreover, the text vector corresponding to the calculation object and the text vector corresponding to the description information may also adopt a related art manner, which is not described in detail in this embodiment. For example, a trained deep neural network model may be employed to compute corresponding text vectors.

According to the voice control method provided by the embodiment, the voice instruction triggered in real time by the user aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired. Moreover, in this embodiment, the target control is obtained by using a similarity method, which is further beneficial to improving the probability of successful execution of the voice command.

Referring to fig. 12, the present application provides a voice control apparatus 700, where the apparatus 700 includes:

the voice instruction converting unit 710 is configured to obtain a voice instruction.

An operation information obtaining unit 720, configured to obtain operational information corresponding to a current user interface, where the current user interface is an interface displayed when the voice instruction is obtained, and the operational information is information used for describing an operation corresponding to the current user interface.

And a target information obtaining unit 730, configured to match the voice instruction with the operational information of the current user interface, so as to determine target operational information from the operational information of the current user interface.

An operation execution unit 740 configured to execute the target operation in response to the determined target operational information.

As one mode, the mode for identifying the current user interface at least includes at least one of the following modes: identifying the current user interface based on a code analysis mode; identifying the current user interface based on a picture and text identification mode; and identifying the current user interface based on a control classification model. Optionally, the training process of the control classification model includes: acquiring a user interface; acquiring controls classified from the user interface; and training the neural network model to be trained through the classified control to obtain a control classification model.

As one manner, the operation information obtaining unit 720 is specifically configured to identify the current user interface based on a code parsing manner to obtain the operational information of the current user interface. In this manner, the target information obtaining unit 730 is specifically configured to, if there is operable information successfully matched with the voice instruction in the operable information, take the operable information successfully matched with the voice instruction in the operable information as target operable information; and if the operable information does not have operable information successfully matched with the voice instruction, identifying the current user interface based on a picture-text identification mode to obtain operable information, and obtaining target operable information based on the operable information obtained by identifying the current user interface in the picture-text identification mode.

Optionally, the target information obtaining unit 730 is further specifically configured to, if there is operable information successfully matched with the control instruction in the operable information obtained by performing image-text recognition on the current user interface, use the operable information successfully matched with the control instruction as the target description information. And if no operable information successfully matched with the control instruction exists in the operable information obtained by identifying the current user interface in an image-text identification mode, identifying the current user interface based on a control classification model to obtain operable information, and obtaining target operable information based on the operable information obtained by identifying through the control classification model.

As one way, the operable information obtained by performing the image-text recognition mode recognition on the current user interface includes the position of the recognized control and the description information. In this way, the operation execution unit 740 is specifically configured to, in response to determining target actionable information from the actionable information identified by the image-text identification manner, take a control corresponding to the target actionable information as a target control; and generating a control instruction corresponding to the target control based on a click simulation mode and the control purpose of the voice instruction, and executing the control instruction.

As one way, the operable information obtained by identifying the current user interface based on the code parsing way includes the identification and description information of the identified control. In this manner, the operation execution unit 740 is specifically configured to, in response to determining target actionable information from the actionable information identified by the code parsing manner, take a control corresponding to the target actionable information as a target control; and generating a control instruction based on the control purpose corresponding to the voice instruction and the identification of the target control.

In one mode, the operation execution unit 740 is specifically configured to, if the target operable information is description information for performing a specified overall operation on the interface, generate a control instruction for the specified overall operation, and execute the control instruction.

As one mode, the target information obtaining unit 730 is specifically configured to obtain an instruction text corresponding to a voice instruction, and obtain a control destination, a control object, and object accessory information based on the instruction text; matching the control object with the operational information; and taking the operable information successfully matched with the control object as target operable information.

By one approach, the actionable information includes a location of a control identified from the current user interface and descriptive information. The operation executing unit 740 is further specifically configured to, if the voice instruction includes location information, match the location of the control included in the operable information with the location information included in the voice instruction; taking the control of the position successfully matched with the position information as a target control; in response to determining the target control, performing a target operation.

As a mode, the operation information obtaining unit 720 is specifically configured to, if the control purpose corresponding to the control instruction is to perform a specified broadcast control operation, identify a control of a broadcast control class included in the current user interface based on a control classification model, so as to obtain the operable information of the current user interface.

As one mode, the target information obtaining unit 730 is specifically configured to update the voice instruction according to a task scene corresponding to the current user interface to obtain a scene control instruction; and matching the scene control instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface.

By one approach, the actionable information includes at least descriptive information of a control in the current user interface. A target information obtaining unit 730, specifically configured to obtain similarity between the voice instruction and description information of a control included in the current user interface; and taking the description information of which the corresponding similarity meets the similarity condition as the target description information. The operation executing unit 740 is further specifically configured to execute the target operation in response to the determined target description information.

In the voice control apparatus provided in this embodiment, after acquiring a voice instruction and acquiring the operational information corresponding to the current user interface, the voice instruction is matched with the operational information corresponding to the current user interface, so as to determine target operational information from the operational information corresponding to the current user interface, and execute a target operation in response to the determined target operational information. Therefore, the voice instruction triggered by the user in real time aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In several embodiments provided herein, the coupling of modules to each other may be electrical. In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

An electronic device provided by the present application will be described below with reference to fig. 13.

Referring to fig. 13, based on the voice control method and apparatus, an electronic device 1000 capable of executing the voice control method is further provided in the embodiment of the present application. The electronic device 1000 includes one or more processors 102 (only one shown), a memory 104, a camera 106, and an audio capture device 108 coupled to each other. The memory 104 stores programs that can execute the content of the foregoing embodiments, and the processor 102 can execute the programs stored in the memory 104.

Processor 102 may include one or more processing cores, among other things. The processor 102 interfaces with various components throughout the electronic device 1000 using various interfaces and circuitry to perform various functions of the electronic device 1000 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104 and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 102 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 102, but may be implemented by a communication chip. By one approach, the processor 102 may be a neural network chip. For example, it may be an embedded neural network chip (NPU).

The Memory 104 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 104 may be used to store instructions, programs, code sets, or instruction sets. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like.

Furthermore, the electronic device 1000 may further include a network module 110 and a sensor module 112 in addition to the aforementioned components.

The network module 110 is used for implementing information interaction between the electronic device 1000 and other devices, for example, transmitting a device control instruction, a manipulation request instruction, a status information acquisition instruction, and the like. When the electronic device 200 is embodied as a different device, the corresponding network module 110 may be different.

The sensor module 112 may include at least one sensor. Specifically, the sensor module 112 may include, but is not limited to: levels, light sensors, motion sensors, pressure sensors, infrared heat sensors, distance sensors, acceleration sensors, and other sensors.

Among other things, the pressure sensor may detect the pressure generated by pressing on the electronic device 1000. That is, the pressure sensor detects pressure generated by contact or pressing between the user and the electronic device, for example, contact or pressing between the user's ear and the mobile terminal. Thus, the pressure sensor may be used to determine whether contact or pressure has occurred between the user and the electronic device 1000, as well as the magnitude of the pressure.

The acceleration sensor may detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and may be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the electronic device 1000, and related functions (such as pedometer and tapping) for vibration recognition. In addition, the electronic device 1000 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer and a thermometer, which are not described herein again.

And the audio acquisition device 110 is used for acquiring audio signals. Optionally, the audio capturing device 110 includes a plurality of audio capturing devices, and the audio capturing devices may be microphones.

As one mode, the network module of the electronic device 1000 is a radio frequency module, and the radio frequency module is configured to receive and transmit electromagnetic waves, and implement interconversion between the electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices. The radio frequency module may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. For example, the radio frequency module may interact with an external device through transmitted or received electromagnetic waves. For example, the radio frequency module may send instructions to the target device.

Referring to fig. 14, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 800 has stored therein a program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 800 includes a non-volatile computer-readable storage medium. The computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 810 may be compressed, for example, in a suitable form.

In summary, according to the voice control method, the voice control device and the electronic device provided by the application, after the voice instruction and the operable information corresponding to the current user interface are acquired, the voice instruction is matched with the operable information corresponding to the current user interface, so that the target operable information is determined from the operable information corresponding to the current user interface, and the target operation is executed in response to the determined target operable information. Therefore, the voice instruction triggered by the user in real time aiming at the seen interface can be better completed in a mode of matching the voice instruction with the operable information corresponding to the interface (current user interface) displayed when the voice instruction is acquired.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for voice control, the method comprising:

acquiring a voice instruction;

identifying a current user interface to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received;

matching the voice instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface;

and executing the target operation in response to the determined target operable information.

2. The method of claim 1, wherein identifying the current user interface comprises at least one of:

identifying the current user interface based on a code analysis mode;

identifying the current user interface based on a picture and text identification mode; and

and identifying the current user interface based on a control classification model.

3. The method of claim 2, wherein the identifying the current user interface to obtain the operational information of the current user interface comprises:

identifying a current user interface based on a code analysis mode to acquire operable information of the current user interface;

the matching the voice instruction with the operational information of the current user interface to determine the target operational information from the operational information of the current user interface includes:

if the operable information has operable information successfully matched with the voice instruction, taking the operable information successfully matched with the voice instruction in the operable information as target operable information;

and if the operable information does not have operable information successfully matched with the voice instruction, identifying the current user interface based on a picture-text identification mode to obtain operable information, and obtaining target operable information based on the operable information obtained by identifying the current user interface in the picture-text identification mode.

4. The method of claim 3, wherein identifying the current user interface based on the teletext identification scheme to obtain the actionable information, and obtaining the target actionable information based on the actionable information obtained by identifying the current user interface in the teletext identification scheme comprises:

and if the operational information obtained by identifying the image-text identification mode of the current user interface comprises operational information successfully matched with the control instruction, taking the operational information successfully matched with the control instruction as target description information.

And if no operable information successfully matched with the control instruction exists in the operable information obtained by identifying the current user interface in an image-text identification mode, identifying the current user interface based on a control classification model to obtain operable information, and obtaining target operable information based on the operable information obtained by identifying through the control classification model.

5. The method of claim 3, wherein the operational information obtained by identifying the current user interface in a manner of text-text recognition comprises a position of the identified control and description information, and wherein performing the target operation in response to determining the target operational information comprises:

in response to the fact that target operable information is determined from the operable information obtained through image-text recognition, taking a control corresponding to the target operable information as a target control;

and generating a control instruction corresponding to the target control based on a click simulation mode and the control purpose of the voice instruction, and executing the control instruction.

6. The method of claim 3, wherein the operational information obtained by identifying the current user interface based on the code parsing manner includes identification and description information of the identified control; said performing a target operation in response to said determining target operational information, comprising:

in response to the fact that target operable information is determined from the operable information obtained through code analysis mode identification, taking a control corresponding to the target operable information as a target control;

and generating a control instruction based on the control purpose corresponding to the voice instruction and the identification of the target control.

7. The method of claim 2, wherein the training process of the control classification model comprises:

acquiring a user interface;

acquiring controls classified from the user interface;

and training the neural network model to be trained through the classified control to obtain a control classification model.

8. The method of claim 1, wherein said performing a target operation in response to said determining target operational information comprises:

and if the target operable information is description information for carrying out specified overall operation on the interface, generating a control instruction for the specified overall operation, and executing the control instruction.

9. The method of claim 1, wherein matching the voice command with the operational information of the current user interface to determine the target operational information from the operational information of the current user interface comprises:

acquiring an instruction text corresponding to a voice instruction, and acquiring a control target, a control object and object auxiliary information based on the instruction text;

matching the control object with the operational information;

and taking the operable information successfully matched with the control object as target operable information.

10. The method of claim 1, wherein the actionable information includes a location of a control identified from the current user interface and description information; the method further comprises the following steps:

if the voice instruction comprises position information, matching the position of the control contained in the operable information with the position information contained in the voice instruction;

taking the control of the position successfully matched with the position information as a target control;

in response to determining the target control, performing a target operation.

11. The method of claim 1, wherein the identifying the current user interface to obtain the operational information of the current user interface comprises:

and if the control purpose corresponding to the control instruction is to perform specified broadcast control operation, identifying a control of a broadcast control class included in the current user interface based on the control classification model so as to obtain operable information of the current user interface.

12. The method of claim 1, wherein matching the voice command with the operational information of the current user interface to determine the target operational information from the operational information of the current user interface comprises:

updating the voice instruction according to a task scene corresponding to the current user interface to obtain a scene control instruction;

and matching the scene control instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface.

13. The method of claim 1, wherein the actionable information includes at least description information of a control in a current user interface, and the matching the voice instruction with the actionable information of the current user interface to determine target actionable information from the actionable information of the current user interface comprises:

acquiring similarity between the voice instruction and description information of a control included in a current user interface;

using the description information of which the corresponding similarity meets the similarity condition as target description information;

said performing a target operation in response to said determining target operational information, comprising:

and executing the target operation in response to the determined target description information.

14. A voice control apparatus, characterized in that the apparatus comprises:

the voice instruction conversion unit is used for acquiring a voice instruction;

the operation information acquisition unit is used for identifying a current user interface to acquire operable information of the current user interface, wherein the current user interface is a user interface displayed when the voice instruction is received;

the target information acquisition unit is used for matching the voice instruction with the operable information of the current user interface so as to determine target operable information from the operable information of the current user interface;

an operation execution unit for executing a target operation in response to the determined target operable information.

15. An electronic device comprising one or more processors and memory;

one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-13.

16. A computer-readable storage medium, having program code stored therein, wherein the method of any of claims 1-13 is performed when the program code is run.

17. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method of any of claims 1-13.