CN111309283A

CN111309283A - Voice control method and device for user interface, electronic equipment and storage medium

Info

Publication number: CN111309283A
Application number: CN202010220645.1A
Authority: CN
Inventors: 李扬; 王雷; 李士岩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-06-19
Anticipated expiration: 2040-03-25
Also published as: CN111309283B

Abstract

The application discloses a voice control method and device of a user interface, electronic equipment and a storage medium, and relates to a voice technology in the technical field of computers. The specific implementation scheme is as follows: the method comprises the steps of firstly acquiring voice information input by a user, then converting the voice information into text information, then generating a control instruction according to the text information and characteristic information in a user interface to be controlled, and finally triggering a control to be controlled corresponding to the control instruction in the user interface to be controlled in response to the control instruction, so that the user can complete interface operation of the user interface to be controlled in a voice mode. Therefore, the user interface is executed through voice, so that the user can conveniently interact with the service equipment more directly, and more compatible and easier service experience is provided for the user.

Description

Voice control method and device for user interface, electronic equipment and storage medium

Technical Field

The present application relates to a voice technology in the field of computer technologies, and in particular, to a method and an apparatus for controlling a user interface with voice, an electronic device, and a storage medium.

Background

With the rapid development of voice technology, the voice technology has wide application in the field of electronic devices (e.g., the field of service devices).

In the existing interactive mode for service equipment, a single-touch interactive mode, a single-voice interactive mode or an interactive mode of simple superposition and combination of the two modes is generally adopted. In addition, for voice interaction, it is also generally required that the user speak a specific wake-up word to perform voice service.

Therefore, at present, the control mode for the service device, especially the control mode for the user interface on the service device, is single, and the business logic requirements of multiple rounds of interaction can be met by combining a touch mode.

Disclosure of Invention

The method and the device for controlling the voice of the user interface, the electronic equipment and the storage medium can directly control the user interface to be controlled according to the voice information input by the user, and can meet the complex business logic requirements of the user.

In a first aspect, an embodiment of the present application provides a method for controlling a user interface with voice, including:

determining text information according to the acquired voice information;

generating a control instruction according to the text information and feature information in the user interface to be controlled, wherein the feature information is characteristic information corresponding to each control in the user interface to be controlled;

and responding to the control instruction, and triggering a control to be controlled corresponding to the control instruction in the user interface to be controlled.

In the embodiment, the voice information input by the user is acquired, the voice information is converted into the text information, the control instruction is generated according to the text information and the characteristic information in the user interface to be controlled, and the control to be controlled corresponding to the control instruction in the user interface to be controlled is triggered in response to the control instruction, so that the user can complete the interface operation of the user interface to be controlled in a voice mode. Therefore, the user interface is executed through voice, so that the user can conveniently interact with the service equipment more directly, and more compatible and easier service experience is provided for the user.

In a second aspect, an embodiment of the present application provides a voice control apparatus for a user interface, including:

the acquisition module is used for acquiring voice information;

the recognition module is used for determining text information according to the voice information;

the generating module is used for generating a control instruction according to the text information and characteristic information in the user interface to be controlled, wherein the characteristic information is characteristic information corresponding to each control in the user interface to be controlled;

and the control module is used for responding to the control instruction and triggering the control to be controlled corresponding to the control instruction in the user interface to be controlled.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of voice control of a user interface as claimed in any one of the first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the voice control method of the user interface according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a program product, where the program product includes: a computer program stored in a readable storage medium, the computer program being readable from the readable storage medium by at least one processor of a server, execution of the computer program by the at least one processor causing the server to perform the method of voice control of a user interface of any of the first aspects.

According to the technology of the application, the user can complete the interface operation of the user interface to be controlled in a voice mode, more direct interaction between the user and the service equipment can be facilitated, and further more compatible and easier service experience is provided for the user.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a diagram of an application scenario in which a voice control method of a user interface according to an embodiment of the present application may be implemented;

FIG. 2 is a schematic diagram according to a first embodiment of the present application;

FIG. 3 is a schematic diagram according to a second embodiment of the present application;

FIG. 4 is an interface interaction diagram of the second embodiment;

FIG. 5 is another interface interaction diagram of the second embodiment;

FIG. 6 is a further interface interaction diagram of the second embodiment;

FIG. 7 is a schematic illustration according to a third embodiment of the present application;

FIG. 8 is a schematic illustration according to a fourth embodiment of the present application;

FIG. 9 is a schematic illustration according to a fifth embodiment of the present application;

fig. 10 is a block diagram of an electronic device for implementing a voice control method of a user interface according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the existing interactive mode for service equipment, a single-touch interactive mode, a single-voice interactive mode or an interactive mode of simple superposition and combination of the two modes is generally adopted. In addition, for voice interaction, it is also generally required that the user speak a specific wake-up word to perform voice service. Therefore, at present, the control mode for the service device, especially the control mode for the user interface on the service device, is single, and the business logic requirements of multiple rounds of interaction can be met by combining a touch mode.

In order to solve the technical problems, the application provides a voice control method and device for a user interface, an electronic device and a storage medium, the voice information input by a user is obtained first, then the voice information is converted into text information, a control instruction is generated according to the text information and feature information in a user interface to be controlled, and finally a control to be controlled corresponding to the control instruction in the user interface to be controlled is triggered in response to the control instruction, so that the user can complete interface operation of the user interface to be controlled in a voice mode. Therefore, the user interface is executed through voice, so that the user can conveniently interact with the service equipment more directly, and more compatible and easier service experience is provided for the user.

Fig. 1 is an application scenario diagram of a voice control method of a user interface according to an embodiment of the present application. As shown in fig. 1, the voice control method of the user interface provided by the present embodiment may be applied to the interaction between the user 100 and the electronic device 200. The user 100 controls the user interface to be controlled displayed on the electronic apparatus 200 by voice. The electronic device 200 may be a robot, a service terminal, a computer, a smart phone, and the like. Furthermore, multiple rounds of interaction with the user 100 may also be achieved in the user interface of the electronic device 200 by displaying a form of "digital person".

For example, when the electronic device 200 is a banking terminal, the user 100 may control a loan application interface displayed in the banking terminal in a voice manner, so that the user may fill in the loan amount, select a loan term, and determine selected information in a voice control manner, so as to complete interaction with the loan application interface only through voice input.

For another example, when the electronic device 200 is a food service terminal, the user 100 may control an ordering interface displayed in the food service terminal in a voice manner, so that the number of people who have a meal is filled in, dishes are selected, and selected information is determined in the voice control manner, so that the user can complete interaction with the ordering interface only through voice input.

Fig. 2 is a schematic diagram according to a first embodiment of the present application. As shown in fig. 2, the voice control method for a user interface provided in this embodiment includes:

and S101, determining text information according to the acquired voice information.

In this step, a microphone may be disposed on the electronic device, and the voice information input by the user is acquired through the microphone. After the electronic device acquires the voice information, the electronic device can recognize the voice information in a voice recognition mode to determine text information corresponding to the voice information.

And S102, generating a control instruction according to the text information and the characteristic information in the user interface to be controlled.

Specifically, the control instruction may be generated according to the recognized text information and the feature information in the user interface to be controlled. The characteristic information may be characteristic information corresponding to each control in the user interface to be controlled, and may be, for example, a name or attribute information of each control.

In a possible implementation manner, for each user interface to be controlled, a group of data lists used for storing characteristic information corresponding to each control in the user interface to be controlled is corresponding, after the text information is obtained, the data lists can be retrieved according to the text information, so that the control to be controlled corresponding to the text information is determined, and the specific operation on the control to be controlled is determined continuously according to the text information, so that the control instruction is generated.

And S103, responding to the control instruction, and triggering a control to be controlled corresponding to the control instruction in the user interface to be controlled.

After the control instruction is generated, the electronic equipment responds to the control instruction and triggers a control to be controlled corresponding to the control instruction in the user interface to be controlled. For example, when the control instruction is an instruction for clicking a "confirm" button, a control corresponding to the "confirm" button is triggered in response to the instruction, so as to realize the effect of controlling each control in the user interface to be controlled by voice.

In the embodiment, the interface operation of the user interface to be controlled can be completed by the user in a voice mode by acquiring the voice information input by the user, converting the voice information into the text information, generating the control instruction according to the text information and the characteristic information in the user interface to be controlled, and triggering the control to be controlled corresponding to the control instruction in the user interface to be controlled in response to the control instruction. Therefore, the user interface is executed through voice, so that the user can conveniently interact with the service equipment more directly, and more compatible and easier service experience is provided for the user.

Fig. 3 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 3, the voice control method for a user interface provided in this embodiment includes:

s201, determining text information according to the acquired voice information.

S202, determining a control to be controlled according to the text information and the characteristic information in the user interface to be controlled.

In a possible implementation manner, for each user interface to be controlled, a group of data lists used for storing characteristic information corresponding to each control in the user interface to be controlled is corresponded, and after the text information is obtained, the control to be controlled corresponding to the text information can be determined by retrieving from the data lists according to the text information.

For example, the feature information may include names of the controls in the user interface to be controlled, and the controls to be controlled are determined according to a first text in the text information and the names of the controls in the user interface to be controlled, where the first text includes keywords in the names of the controls to be controlled. The control to be controlled is determined by matching the first text in the text information with the keywords in the name of the control to be controlled, and the aim of accurately positioning the target control controlled by the user is fulfilled.

And S203, generating a control instruction according to the text information and the control type of the control to be controlled.

And S204, responding to the control instruction, and triggering a control to be controlled corresponding to the control instruction in the user interface to be controlled.

After the control to be controlled is determined, a control instruction can be generated according to the text information and the control type of the control to be controlled. And responding to the control instruction, and triggering a control to be controlled corresponding to the control instruction in the user interface to be controlled.

And if the control type of the control to be controlled is an input box control, determining an input instruction according to a second text in the text information, and responding to the input instruction to input the second text in the box to be input in the user interface to be controlled. Therefore, when the control type of the control to be controlled is determined to be the input box control, the second text in the text information can be acquired to fill the information of the box to be input in the user interface to be controlled, and therefore the user can fill the content of the box to be input in a voice information input mode.

The above steps will be described in detail below by using a loan application interface displayed in the banking terminal as the user interface to be controlled.

Fig. 4 is an interface interaction diagram of the second embodiment, and fig. 5 is another interface interaction diagram of the second embodiment. As shown in fig. 4-5, in the current interface, there are displayed a "loan amount" input box, and a "confirm" button and a "cancel" button.

When the user inputs 'loan amount of 5 ten thousand' by voice, the control to be controlled can be positioned to the 'loan amount' input box according to the first text 'loan amount' in the text information, and then an input instruction is determined according to the second text '5 ten thousand' in the text information, so that '50000' is input in the 'loan amount' input box through the input instruction.

And when the control type of the control to be controlled is the click control, determining a click instruction according to the first text, and responding to the click instruction to trigger the control to be clicked corresponding to the click instruction in the user interface to be controlled. Therefore, when the control type of the control to be controlled is determined to be the click control, the control to be clicked corresponding to the first text in the text information can be obtained, and then a click instruction is generated, so that the user can click the key in the user interface to be controlled in a mode of inputting the voice information.

With continued reference to fig. 5, after inputting "50000" in the "loan amount" input box, the user continues to enter "ok" by voice, so that the "ok" button in the loan application interface is located according to the text information and is triggered according to the generated click command.

And when the control type of the control to be controlled is the selection control, determining a selection instruction according to the second text in the text information, and responding to the selection instruction to select an option corresponding to the second text in a frame to be selected in the user interface to be controlled. Therefore, when the control type of the control to be controlled is determined to be the selection control, the second text in the text information can be acquired to select the multiple options in the selection box to be selected, so that the options corresponding to the text information are determined, and the multiple options in the selection box to be selected by the user in a voice information input mode are selected.

Fig. 6 is a further interface interaction diagram of the second embodiment. As shown in fig. 6, in the loan term selection interface, the user continues to input "loan term of 6 months" by voice, the control to be controlled may be positioned to the "loan term" selection box according to the first text "loan term" in the text message, and then a selection instruction is determined according to the second text "loan term" in the text message, so that "6 months" is selected in the "loan term" selection box by the selection instruction.

In the embodiment, the voice information input by the user is acquired, the voice information is converted into the text information, the corresponding control to be controlled is accurately positioned according to the text information and the characteristic information in the user interface to be controlled, the control instruction is generated according to the text information and the control type of the control to be controlled, and the control to be controlled corresponding to the control instruction in the user interface to be controlled is triggered in response to the control instruction, so that the user can complete the interface operation of the user interface to be controlled in a voice mode.

In addition, on the basis of the above embodiment, after the text information is determined according to the acquired voice information, feedback information may be determined according to the text information and language environment information, where the language environment information includes a plurality of intention information corresponding to the text information, and the feedback information is intention information with the highest confidence in the service request among the plurality of intention information, and then the feedback information is output. Therefore, after the user inputs voice information, a plurality of intention information is determined according to the converted text information and the language environment information, and intention information with the highest service request confidence coefficient in the intention information is selected as feedback information to be fed back, so that the skill of conversation interaction is achieved, the understanding of natural voice and the logic realization of interaction skill can be based, and the long-time voice interaction and service flow can be completed.

Specifically, for determining the feedback information according to the text information and the language environment information, multiple rounds of historical text information corresponding to the voice information input by the user in multiple rounds of history may be obtained first, then multiple rounds of historical feedback information output in response to the voice information input by the multiple rounds of history may be obtained, and finally, the service request confidence of each intention in the multiple intention information may be determined according to the multiple rounds of historical text information and the multiple rounds of historical feedback information. Therefore, the service request confidence degree of each intention in the plurality of intention information is determined based on the plurality of rounds of historical text information and the plurality of rounds of historical feedback information, so that the language environment can be tracked and understood for a longer time, the true purpose corresponding to the voice information input by the user is determined, the interaction skills of the plurality of rounds of conversations are improved, and the mechanical supplement of the plurality of rounds of conversations in the prior art is avoided.

In order to intuitively explain the technical effect of determining the feedback information according to the text information and the language environment information, the following examples are further provided:

the multiple rounds of dialog of the existing mechanical complement are as follows:

the user: recommending financial products;

an electronic device: a, preparing a product;

the user: the yield is too low;

an electronic device: and B, product.

In the existing mechanically supplemented multi-turn dialog, when the electronic device recommends a product B, the electronic device only supplements the voice information input by the user in the last turn, and actually supplements the product with higher income stored in the electronic device. Since it is only supplemented with the voice message "the profit is too low" input by the last round of user, there is a high possibility that the profit of the second recommended B product is lower than that of the a product. It can be seen that the recommendation of the electronic device at this time is only a mechanical complement, not the true intention of the user.

And the situation that the feedback information is determined in multiple rounds of conversations according to the text information and the language environment information is as follows:

the user: recommending financial products;

an electronic device: a, preparing a product;

the user: the yield is too low;

an electronic device: and C, products.

When the electronic equipment recommends the C product, because the service request confidence of a Natural language (NLP for short) is increased, the service request confidence of each intention in a plurality of intention information is determined based on a plurality of rounds of historical text information and a plurality of rounds of historical feedback information, so as to track and understand the language environment for a longer time.

In addition, in the process of carrying out voice control on the user interface, the feedback information can be determined according to the text information and the language environment information, and then the control instruction is generated according to the feedback information and the characteristic information in the user interface to be controlled, so that the voice control on the user interface is more accurate.

Fig. 7 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 7, the voice control method for a user interface provided in this embodiment includes:

s301, acquiring image information of the face of the user, and generating a wake-up instruction according to the voice information and the image information.

In order to naturally perform a long-time voice interaction in a multi-turn conversation process so as to provide a more compatible and relaxed service experience for a user, it is necessary to implement natural wake-up of a voice service of an electronic device.

Specifically, the voice service may be configured to acquire voice information input by a user and image information of a face of the user, and then generate a wake-up instruction according to the voice information and the image information, where the wake-up instruction is used to wake up the voice service. Whether the wake-up command is generated or not can be determined according to the matching degree between the voice information and the image information, and when the matching degree between the voice information and the image information exceeds a set threshold value, the wake-up condition can be considered to be met. Therefore, the voice information input by the user and the image information of the face of the user are obtained, and the awakening instruction is generated according to the voice information and the image information, so that natural conversation between the user and the electronic equipment in multiple rounds of conversation and multiple rounds of conversation can be realized, and the awakening words do not need to be input before each round of conversation starts in the prior art.

In one possible design, for an implementation manner of generating the wake-up instruction according to the voice information and the image information, the lip movement frequency may be determined according to the image information, and if the voice information matches the lip movement frequency, the wake-up instruction is generated. Therefore, whether the awakening instruction is generated or not is determined by comparing the lip movement frequency with the voice information, so that the natural conversation between the user and the electronic equipment in multiple rounds of conversation can be realized.

And S302, determining text information according to the acquired voice information.

And S303, determining a control to be controlled according to the text information and the characteristic information in the user interface to be controlled.

And S304, generating a control instruction according to the text information and the control type of the control to be controlled.

S305, responding to the control instruction, and triggering a to-be-controlled control corresponding to the control instruction in the to-be-controlled user interface.

It should be noted that, for the specific implementation of S302 to S305 in this embodiment, reference may be made to the description of S201 to S204 in the embodiment shown in fig. 3, and details are not repeated here.

In order to intuitively explain the technical effect of the voice control method of the user interface in the present embodiment, the description may be continued with reference to the examples shown in fig. 4 to 6.

First, if the user needs to trigger the "loan amount" input box and then input "50000" through the keyboard in order to input the loan amount while controlling the loan application interface displayed in the banking terminal through the interactive manner in the related art. Or holding down the voice input key or inputting the voice message of "5 ten thousand" after the wakeup word is spoken.

After the input is completed, the user needs to click the 'confirm' button to confirm, and then selects '6 months' in the 'loan term' selection box in the loan term selection interface by touching.

If the voice input information needs to be continued in the input boxes of other interfaces, the voice input key needs to be triggered again, or the awakening word needs to be spoken again for awakening, so that natural interaction between the user and the user interface displayed in the electronic equipment cannot be realized.

However, if the user interface control is performed by the voice control method in the embodiment shown in fig. 7, the specific flow is as follows:

with continued reference to fig. 5-6, when the user inputs "5 ten thousand loan amount" by voice, the electronic device wakes up the voice service naturally by acquiring the voice information input by the user and the image information of the user's face and then generating a wake-up command according to the voice information and the image information.

Then, the 'loan amount of 5 ten thousand' input by voice is converted into text information, the control to be controlled is positioned to the 'loan amount' input box according to the first text 'loan amount' in the text information, then an input instruction is determined according to the second text '5 ten thousand' in the text information, and '50000' is input in the 'loan amount' input box through the input instruction.

After inputting "50000" in the "loan amount" input box, the user continues to input "determine" by voice. At this time, the electronic device keeps the awakening state of the voice service by continuously acquiring the voice information input by the user and the image information of the face of the user and then generating an awakening instruction according to the voice information and the image information. And positioning the 'confirm' key in the loan application interface according to the text information, and triggering the 'confirm' key according to the generated click command.

When jumping to the loan term selection interface, the user continues to input the loan term of 6 months by voice, and similarly, the voice service continues to maintain the natural awakening state, at this time, the control to be controlled can be positioned to the loan term selection box according to the first text of the loan term in the text message, and then the selection instruction is determined according to the second text of the loan term in the text message, so that the "6 months" is selected in the loan term selection box by the selection instruction.

Therefore, based on a natural awakening mode, voice information input by a user is acquired firstly, then the voice information is converted into text information, the corresponding control to be controlled is accurately positioned according to the text information and characteristic information in the user interface to be controlled, then a control instruction is generated according to the text information and the control type of the control to be controlled, and finally the control to be controlled corresponding to the control instruction in the user interface to be controlled is triggered in response to the control instruction, so that the user can completely operate the interface of the user interface to be controlled in a voice mode without awakening or touching, and the service experience of the user is greatly improved.

Fig. 8 is a schematic diagram according to a fourth embodiment of the present application. As shown in fig. 8, the speech control apparatus 400 of a user interface provided in this embodiment includes:

an obtaining module 401, configured to obtain voice information;

a recognition module 402, configured to determine text information according to the voice information;

a generating module 403, configured to generate a control instruction according to the text information and feature information in the user interface to be controlled, where the feature information is characteristic information corresponding to each control in the user interface to be controlled;

and the control module 404 is configured to trigger a to-be-controlled control corresponding to the control instruction in the to-be-controlled user interface in response to the control instruction.

In one possible design, the generating module 403 is specifically configured to:

determining a control to be controlled according to the text information and the characteristic information in the user interface to be controlled;

and generating a control instruction according to the text information and the control type of the control to be controlled.

In one possible design, the characteristic information includes names of respective controls in the user interface to be controlled;

correspondingly, the generating module 403 is specifically configured to:

and determining the control to be controlled according to a first text in the text information and the name of each control in the user interface to be controlled, wherein the first text comprises keywords in the name of the control to be controlled.

In this embodiment, the control to be controlled is determined by matching the first text in the text information with the keyword in the name of the control to be controlled, so that the purpose of accurately positioning the target control controlled by the user is achieved.

if the control type of the control to be controlled is an input box control, determining an input instruction according to a second text in the text information;

correspondingly, the control module 404 is specifically configured to:

and responding to the input instruction, and inputting a second text in a to-be-input box in the to-be-controlled user interface.

In this embodiment, when it is determined that the control type of the control to be controlled is the input box control, the second text in the text information may be acquired to fill the information of the box to be input in the user interface to be controlled, so that the user can fill the content of the box to be input by inputting the voice information.

if the control type of the control to be controlled is a selection control, determining a selection instruction according to a second text in the text information;

correspondingly, the control module 404 is specifically configured to:

and responding to the selection instruction, and selecting an option corresponding to the second text in a to-be-selected box in the to-be-controlled user interface.

In this embodiment, when the control type of the control to be controlled is determined to be the selection control, the second text in the text information may be acquired to select the multiple options in the to-be-selected box, so as to determine the options corresponding to the text information, thereby enabling the user to select the multiple options in the to-be-selected box by inputting the voice information.

if the control type of the control to be controlled is a click control, determining a click instruction according to the first text;

correspondingly, the control module 404 is specifically configured to:

and responding to the click command, and triggering a to-be-clicked control corresponding to the click command in the to-be-controlled user interface.

In this embodiment, when it is determined that the control type of the control to be controlled is the click control, the control to be clicked corresponding to the first text in the text information may be obtained, and then a click instruction is generated, so that a user can click a key in the user interface to be controlled by inputting the voice information.

In one possible design, the obtaining module 401 is configured to obtain voice information input by a user;

an obtaining module 401, configured to obtain image information of a face of a user;

and a generating instruction 403, configured to generate a wake-up instruction according to the voice information and the image information, where the wake-up instruction is used to wake up the voice service.

In this embodiment, by acquiring the voice information input by the user and the image information of the face of the user, and then generating the wake-up instruction according to the voice information and the image information, a natural conversation between the user and the electronic device in multiple rounds of conversation can be realized without inputting a wake-up word before each round of conversation in the prior art.

In one possible design, instructions 403 are generated specifically for:

determining lip movement frequency according to the image information;

and if the voice information is matched with the lip movement frequency, generating a wake-up instruction.

In the embodiment, whether the wake-up instruction is generated or not is determined by comparing the lip movement frequency with the voice information, so that the natural conversation between the user and the electronic equipment in the multi-turn conversation of the multi-turn conversations can be realized.

On the basis of the embodiment shown in fig. 8, fig. 9 is a schematic view according to a fifth embodiment of the present application. As shown in fig. 9, the speech control apparatus 400 of a user interface according to this embodiment further includes:

a determining module 405, configured to determine feedback information according to the text information and the language environment information, where the language environment information corresponds to multiple intention information, and the feedback information is intention information with a highest confidence in the service request in the multiple intention information;

and an output module 406, configured to output the feedback information.

In this embodiment, after the user inputs the voice information, a plurality of pieces of intention information are determined according to the converted text information and the language environment information, and intention information with the highest confidence of the service request among the plurality of pieces of intention information is selected as feedback information to be fed back, so that the skill of the dialogue interaction is achieved, and the long-time voice interaction and the service flow can be completed based on the understanding of natural voice and the logic realization of the interaction skill.

In one possible design, the determining module 405 is specifically configured to:

acquiring multiple rounds of historical text information corresponding to voice information input by multiple rounds of history of a user;

acquiring multi-round historical feedback information output in response to voice information input by multi-round history;

and determining the service request confidence degree of each intention in the plurality of intention information according to the plurality of rounds of historical text information and the plurality of rounds of historical feedback information.

determining feedback information according to the text information and the language environment information;

and generating a control instruction according to the feedback information and the characteristic information in the user interface to be controlled.

In this embodiment, in the process of performing voice control on the user interface, the feedback information may be determined according to the text information and the language environment information, and then the control instruction is generated according to the feedback information and the feature information in the user interface to be controlled, so that the voice control performed on the user interface is more accurate.

The voice control apparatus of the user interface of the embodiment shown in fig. 8 to fig. 9 may execute the steps in the above method embodiment, and specific implementation processes and technical principles thereof refer to the relevant description in the embodiment, which is not described herein again.

Fig. 10 is a block diagram of an electronic device for implementing a voice control method of a user interface according to an embodiment of the present application. As shown in fig. 10, the electronic device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, robots, personal digital assistants, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 10, the electronic device 500 includes: one or more processors 501 and memory 502, as well as interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, a plurality of electronic devices may be connected, each device providing part of the necessary operations. In fig. 6, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the voice control method of the user interface provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform a voice control method of a user interface provided by the present application.

The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the voice control method of the user interface in the embodiments of the present application. The processor 501 executes various functional applications of the server and data processing, i.e., a voice control method of the user interface in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 502.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created according to use of the calibration system of the vehicle-mounted camera external parameter of fig. 6, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 may optionally include memory located remotely from processor 501, which may be connected to the electronic device depicted in FIG. 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device shown in fig. 10 may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 10 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus shown in fig. 10, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), GPUs (graphics processors), FPGA (field programmable gate array) devices, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for voice control of a user interface, comprising:

determining text information according to the acquired voice information;

2. The method according to claim 1, wherein the generating a control command according to the text information and the feature information in the user interface to be controlled comprises:

determining the control to be controlled according to the text information and the characteristic information in the user interface to be controlled;

and generating the control instruction according to the text information and the control type of the control to be controlled.

3. The method according to claim 2, wherein the feature information includes names of respective controls in the user interface to be controlled;

correspondingly, the determining the control to be controlled according to the text information and the feature information in the user interface to be controlled includes:

4. The method according to claim 3, wherein the generating the control instruction according to the text information and the control type of the control to be controlled comprises:

correspondingly, the triggering the to-be-controlled control corresponding to the control instruction in the to-be-controlled user interface in response to the control instruction includes:

and responding to the input instruction, and inputting the second text in a to-be-input box in the to-be-controlled user interface.

5. The method according to claim 3, wherein the generating the control instruction according to the text information and the control type of the control to be controlled comprises:

and responding to the selection instruction, and selecting an option corresponding to the second text in a to-be-selected frame in the to-be-controlled user interface.

6. The method according to claim 3, wherein the generating the control instruction according to the text information and the control type of the control to be controlled comprises:

and responding to the click instruction, and triggering a to-be-clicked control corresponding to the click instruction in the to-be-controlled user interface.

7. The method for controlling the voice of the user interface according to any one of claims 1 to 6, before the determining the text information according to the acquired voice information, further comprising:

acquiring the voice information input by a user;

acquiring image information of a user face, and generating a wake-up instruction according to the voice information and the image information, wherein the wake-up instruction is used for waking up voice service.

8. The method of claim 7, wherein the generating a wake-up command according to the voice message and the image message comprises:

determining lip movement frequency according to the image information;

and if the voice information is matched with the lip movement frequency, generating the awakening instruction.

9. The method for controlling the voice of the user interface according to any one of claims 1 to 6, further comprising, after determining the text information according to the acquired voice information:

determining feedback information according to the text information and language environment information, wherein the language environment information comprises a plurality of intention information corresponding to the text information, and the feedback information is intention information with the highest confidence level of the service request in the intention information;

and outputting the feedback information.

10. The method of claim 9, wherein the determining feedback information according to the text information and the language environment information comprises:

11. The method according to claim 10, wherein the generating a control command according to the text information and the feature information in the user interface to be controlled comprises:

determining the feedback information according to the text information and the language environment information;

and generating the control instruction according to the feedback information and the characteristic information in the user interface to be controlled.

12. A voice control apparatus for a user interface, comprising:

the acquisition module is used for acquiring voice information;

13. The speech control apparatus of a user interface according to claim 12, wherein the generating module is specifically configured to:

14. The voice-operated apparatus of claim 13, wherein the feature information comprises names of respective controls in the user interface to be controlled;

correspondingly, the generating module is specifically configured to:

15. The speech control apparatus of a user interface according to claim 14, wherein the generating module is specifically configured to:

the control module is specifically configured to:

16. The speech control apparatus of a user interface according to claim 14, wherein the generating module is specifically configured to:

the control module is specifically configured to:

17. The speech control apparatus of a user interface according to claim 14, wherein the generating module is specifically configured to:

the control module is specifically configured to:

18. The voice control apparatus of user interface according to any one of claims 12-17, wherein the obtaining module is further configured to obtain the voice information input by the user;

the acquisition module is further used for acquiring image information of the face of the user and generating a wake-up instruction according to the voice information and the image information, wherein the wake-up instruction is used for waking up voice service.

19. The speech control apparatus of a user interface according to claim 18, wherein the generating module is specifically configured to:

determining lip movement frequency according to the image information;

20. The voice control apparatus of a user interface according to any one of claims 12 to 17, further comprising:

the determining module is used for determining feedback information according to the text information and language environment information, wherein the language environment information comprises a plurality of intention information corresponding to the text information, and the feedback information is intention information with the highest confidence level of the service request in the intention information;

and the output module is used for outputting the feedback information.

21. The speech control apparatus of a user interface according to claim 20, wherein the determining module is specifically configured to:

22. The speech control apparatus of a user interface according to claim 21, wherein the generating module is specifically configured to:

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.