CN115877997B

CN115877997B - Voice interaction method, system and storage medium for interaction elements

Info

Publication number: CN115877997B
Application number: CN202211545577.1A
Authority: CN
Inventors: 余怀军
Original assignee: Rivotek Technology Jiangsu Co Ltd
Current assignee: Rivotek Technology Jiangsu Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-11-10
Anticipated expiration: 2042-12-01
Also published as: CN115877997A

Abstract

The application relates to the technical field of man-machine interaction, in particular to a voice interaction method, a voice interaction system and a storage medium for interaction elements; the method comprises the following steps: the user wakes up the special voice mode by long-pressing the interactive element of the UI interface; receiving interactive element-oriented voice information sent by a user in a special voice mode; identifying the voice information, and responding to the instruction by the client; the special speech mode is exited. By adopting a special voice mode and by long-pressing the target object of the hit voice interaction through touch control, the problem that certain elements in the UI interface are difficult to select through voice is solved; the background is independently configured with the instruction set of the interaction element, so that the accuracy and convenience of voice interaction are improved; a training model is built through machine learning, a recommendation instruction list is displayed on a UI interface, interactive experience of a user is improved, and smoothness of voice interaction between the user and the intelligent terminal is guaranteed.

Description

Voice interaction method, system and storage medium for interaction elements

Technical Field

The application relates to the technical field of man-machine interaction, in particular to a voice interaction method, a voice interaction system and a storage medium for interaction elements.

Background

The voice is the most convenient and effective means for people to acquire and communicate information, and in recent years, the voice interaction technology is widely applied to various scenes of the B end and the C end, including medical treatment, customer service, education, intelligent home, mobile equipment, vehicle-mounted and the like, and the voice interaction becomes a core interaction mode of the intelligent equipment, so that great convenience is brought to man-machine interaction of the scenes of automobile driving, intelligent home and the like. In the existing voice interaction scheme, after voice is awakened through an awakening word or a key, interactive elements are selected through voice, voice is converted into characters through an automatic voice recognition technology, semantic understanding is carried out through NLP (natural language processing), corresponding skills are matched, voice broadcasting is carried out through TTS (voice synthesis technology), and interface display and action execution are carried out simultaneously.

However, most of UI interfaces of the current intelligent devices are designed based on touch interaction, so that the connection with voice interaction is weak, and some elements in the UI interfaces, such as pictures, long texts, foreign texts, buttons and other inconvenient subjects hit by voice, are difficult to select by voice. The existing voice interaction method is limited by the complexity of interaction elements, so that the interaction is inconvenient, the recognition is inaccurate, and the user experience has a plurality of defects. Therefore, a voice interaction method, a voice interaction system and a voice interaction storage medium for interaction elements are provided.

Disclosure of Invention

Aiming at the problems of inconvenient interaction and inaccurate recognition caused by selecting interactive elements by voice in the prior art, the voice interaction method, system and storage medium for the interactive elements are provided.

In order to achieve the above object, the present application is realized by the following technical scheme:

in one aspect, the present application provides a voice interaction method for interaction elements, where the method includes:

the user wakes up the special voice mode by long-pressing the interactive element of the UI interface;

receiving interactive element-oriented voice information sent by a user in a special voice mode;

identifying the voice information, and responding to the instruction by the client;

the special speech mode is exited.

As a preferable scheme of the application, the special voice mode is a voice interaction mode facing to interaction elements, and the interaction elements comprise all elements capable of being touched in a UI interface of a client.

As a preferred embodiment of the present application, the method for waking up a special voice mode specifically includes:

selecting a target object for voice interaction by long-time pressing of an interaction element of the UI interface, and performing selection feedback;

detecting the selected area range of the long-pressed interactive elements, identifying the number of the interactive elements in the selected area range, and judging the types of the interactive elements if only one interactive element exists in the selected area range; if more than one is available, the selection is pressed again for a long time;

judging the interactive element type, judging whether the selected interactive element type supports the special voice mode, if so, waking up the special voice mode, otherwise, not executing the waking up.

As a preferable scheme of the application, the long press comprises a single-finger long press, a multi-finger long press or a long press dragging, and the long time is not less than 500ms; the selected feedback includes at least one of an audible cue, a visual change, or vibration; the selection range of the interactive element is preset by the system and is associated with the instruction.

As a preferable scheme of the application, corresponding instruction sets are respectively configured according to different interactive elements, at least 3 instructions are preset for each interactive element to form an instruction set, and the instruction set is stored in a background database in advance.

As a preferable scheme of the application, the voice information is identified, and the specific method for the client to respond to the instruction comprises the following steps:

recognizing voice information based on an automatic voice recognition technology, converting the voice information into text information, carrying out semantic understanding on the text information based on a natural language processing technology, and pulling an instruction set corresponding to interactive elements by a client to judge whether the voice information hits an instruction or not;

if hit, fill the instruction main body, carry out the corresponding instruction; if not, then not execute.

As a preferred embodiment of the present application, the method further comprises: and collecting historical voice information and historical response instructions for the historical voice information, building a training model through machine learning, and displaying a recommendation instruction list on the UI interface after the special voice mode is awakened next time.

As a preferable scheme of the application, the exiting special voice mode is divided into active exiting and passive exiting; the active exit comprises stop long-press exit and voice exit, and the passive exit comprises interrupt exit by other processes and overtime exit.

On the other hand, the application provides a voice interaction system facing to interaction elements, which comprises a wake-up module, a voice recognition module, an instruction response module and an exit module;

the wake-up module comprises a selection unit, a selection area judging unit, a type judging unit and a special voice mode wake-up unit;

the selection unit is used for selecting a target object for voice interaction by long-pressing interaction elements of the UI interface and performing selection feedback, wherein the long-pressing comprises single-finger long-pressing, multi-finger long-pressing or long-pressing dragging, and the time is not less than 500ms, and the selection feedback comprises at least one of voice prompt, visual change or vibration;

the selection area judging unit is used for detecting the selection area range of the long-pressed interaction element, identifying the number of the interaction elements in the selection area range, and the selection area range of the interaction element is preset by the system and is associated with the instruction;

the type judging unit is used for judging the type of the interactive element and judging whether the selected interactive element type supports a special voice mode or not;

the special voice mode awakening unit is used for awakening a special voice mode;

the voice recognition module comprises a voice acquisition unit, a voice conversion unit and a processing unit;

the voice acquisition unit is used for acquiring voice information which is sent by a user and faces to the interactive elements;

the voice conversion unit is used for converting the voice information into text information;

the processing unit is used for carrying out semantic understanding on the text information;

the instruction response module comprises an instruction configuration unit, an instruction pulling unit and an instruction executing unit;

the instruction configuration unit is used for respectively configuring corresponding instruction sets according to different interaction elements, wherein each interaction element is provided with at least 3 instructions to form an instruction set, and the instruction set is stored in a background database in advance;

the instruction pulling unit is used for pulling an instruction set by the client and judging whether the voice information hits or not;

the instruction execution unit is used for executing a response instruction;

the instruction response module further comprises an instruction recommendation unit, wherein the instruction recommendation unit is used for collecting historical voice information and historical response instructions for the historical voice information, building a training model through machine learning, and displaying a recommendation instruction list on the UI after the special voice mode is awakened next time;

the exit module comprises an active exit unit for actively exiting the special voice mode and a passive exit unit for passively exiting the special voice mode.

The present application also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements an interactive element oriented speech interaction method as described above.

Compared with the prior art, the application has the following beneficial effects: by adopting a special voice mode and by long-pressing the target object of the hit voice interaction through touch control, the problem that certain elements in the UI interface are difficult to select through voice is solved; the background is independently configured with the instruction set of the interaction element, so that the accuracy and convenience of voice interaction are improved; a training model is built through machine learning, a recommendation instruction list is displayed on a UI interface, interactive experience of a user is improved, and smoothness of voice interaction between the user and the intelligent terminal is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of a method in a preferred embodiment of the application;

FIG. 2 is a diagram of an instruction set data structure in a preferred embodiment of the present application;

fig. 3 is a modular block diagram of the system in a preferred embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.

The voice awakening is generally performed through awakening words or keys, and interactive elements are selected through voice after awakening; however, some elements in the UI interface are difficult to select through voice, so as to affect accuracy and convenience of interaction experience, for example: when listening to songs, adding [ XX songs ] to the play list, wherein the [ XX songs ] have high reading cost and low identification accuracy; during desktop editing, remove [ XX desktop card ], [ XX desktop card ] has uncertainty and requires blind guessing. Aiming at the problems of inconvenient interaction and inaccurate recognition of interaction elements in voice selection, a voice interaction method, a voice interaction system and a voice interaction storage medium for the interaction elements are provided.

As shown in fig. 1, a preferred embodiment of the present application provides a voice interaction method for interaction elements, which includes:

step 1: the user wakes up the special voice mode by long-pressing the interactive element of the UI interface;

the special voice mode is a voice interaction mode facing to interaction elements, wherein the interaction elements refer to all elements capable of being touched in a UI interface of a client, such as pictures, characters, links, components (icons, buttons, popups and the like). When the interactive elements are pictures, long texts, foreign language, buttons and the like, and a main body hit by voice is inconvenient, or when a plurality of similar named interactive elements exist in a page, the accuracy and convenience of voice interaction can be improved by using a method of long-pressing the interactive elements. The long-press selection is a purposeful and definite operation, and the gesture is defined as a triggering mode of a special voice mode, so that a target user can be effectively prevented from waking up the special voice mode by mistake.

The wake-up special speech mode mainly comprises the following steps:

step 11: selecting a target object for voice interaction by the interaction element of the long-press UI, wherein the long-press comprises a single-finger long-press, a multi-finger long-press or a long-press dragging, the default setting is that the long time is not less than 500ms, and the long-press duration can be modified according to the scene effect;

step 12: after long-press triggering, selecting feedback is carried out, wherein the selected feedback comprises at least one of sound prompt, visual change or vibration, and the selected feedback effect can be customized according to the scene effect;

step 13: detecting the selection range of the long-pressed interaction element, preferably selecting the area through gestures, and expanding the selection range through long-pressed dragging, wherein the selection range of the interaction element is preset by a system and is associated with the instruction; identifying the number of the interactive elements in the selection area, and judging the types of the interactive elements if only one interactive element exists in the selection area; if more than one is available, the selection is pressed again for a long time; optionally, the method can be expanded into a plurality of interaction elements according to project requirements;

step 14: judging the interactive element type, judging whether the selected interactive element type supports the special voice mode, if so, waking up the special voice mode, otherwise, not executing the waking up.

Step 2: receiving interactive element-oriented voice information sent by a user in a special voice mode;

step 3: identifying the voice information, and responding to the instruction by the client;

the response instruction mainly comprises the following steps:

step 31: the method comprises the steps that a background preset instruction set is used for respectively configuring corresponding instruction sets according to different interaction elements, at least 3 instructions are preset for each interaction element to form an instruction set, and the instruction sets are stored in a background database in advance; when the general voice command affects more command hits, a special voice mode facing interactive elements is used for presetting a command set, so that voice interaction accuracy can be improved;

step 32: the client pulls the instruction, recognizes the voice information sent by the user based on an automatic voice recognition technology, converts the voice information into text information, carries out semantic understanding on the text information based on a natural language processing technology, pulls an instruction set corresponding to the interactive element, and judges whether the voice information hits the instruction;

step 33: if the voice information of the user hits the instruction in the instruction set corresponding to the interaction element, filling the instruction main body, and executing the corresponding instruction; if not, then not execute.

For example, a music playing page wakes up a voice mode facing [ the words ] by long-pressing "lyrics", and in the voice mode, voice instructions in a preset instruction set such as "copy words", "play from here" and the like are supported and identified; if the voice message sent by the user is a voice instruction in a non-instruction set such as 'download picture', the voice message is judged to be a miss, and the corresponding instruction is not executed.

The web site browses pages, and long-press of the picture wakes up a voice mode facing the picture, and only voice instructions such as downloading the picture, searching the picture and the like are supported in the voice mode.

The web site browses pages, and long-pressing the word wakes up the voice mode facing the word, and only voice instructions such as copy word, translation and the like are supported in the voice mode.

A social chat page wakes up a voice mode facing [ the link ] by long pressing a chat link ], and only voice instructions such as "open with browser", "add to collection folder", "copy" are supported in the voice mode.

In the navigation application, a long-press specific address wakes up a voice mode facing [ the address ], and only voice instructions such as navigation to the place, search for surrounding food/hotels and the like are supported in the voice mode.

Further comprises: and collecting historical voice information and historical response instructions for the historical voice information, building a training model through machine learning, and displaying a recommendation instruction list on the UI interface after waking up the special voice mode next time.

Step 4: the special speech mode is exited.

Exiting special speech modes is divided into active exiting and passive exiting; active exit includes stop long press exit, voice exit, passive exit includes interrupt exit by other processes, timeout exit.

Referring to fig. 2, another embodiment of the present application provides an interactive element-oriented voice interactive system, which includes a wake-up module, a voice recognition module, an instruction response module, and an exit module;

the selection unit is used for selecting a target object for voice interaction by long-pressing an interaction element of the UI interface, and performing selection feedback, wherein the long-pressing comprises single-finger long-pressing, multi-finger long-pressing or long-pressing dragging, and the long time is not less than 500ms, and the selection feedback comprises at least one of voice prompt, visual change or vibration;

the selection area judging unit is used for detecting the selection area range of the long-pressed interactive elements, identifying the number of the interactive elements in the selection area range, and presetting the selection area range of the interactive elements by a system and associating the selection area range with the instruction;

the type judging unit is used for judging the type of the interactive element and judging whether the selected type of the interactive element supports a special voice mode or not;

a special voice mode awakening unit for awakening a special voice mode;

the voice conversion unit is used for converting voice information into text information;

the instruction configuration unit is used for respectively configuring corresponding instruction sets according to different interaction elements, and each interaction element is provided with at least 3 instructions to form an instruction set, and the instruction set is stored in a background database in advance;

the instruction pulling unit is used for pulling the instruction set by the client and judging whether the voice information hits or not;

the instruction execution unit is used for executing the response instruction;

the instruction response module further comprises an instruction recommendation unit, wherein the instruction recommendation unit is used for collecting historical voice information and historical response instructions for the historical voice information, building a training model through machine learning, and displaying a recommendation instruction list through the UI interface after the special voice mode is awakened next time; a training model is built through machine learning, a recommendation instruction list is displayed on a UI interface, interactive experience of a user is improved, and smoothness of voice interaction between the user and the intelligent terminal is guaranteed.

A storage medium having stored thereon a computer program, which when executed by a processor implements an interactive element oriented speech interaction method as described above.

In summary, the application adopts a special voice mode, and solves the problem that certain elements in the UI interface are difficult to select through voice by touching the target object of long-press hit voice interaction; the background is independently configured with the instruction set of the interaction element, so that the accuracy and convenience of voice interaction are improved; a training model is built through machine learning, a recommendation instruction list is displayed on a UI interface, interactive experience of a user is improved, and smoothness of voice interaction between the user and the intelligent terminal is guaranteed.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. An interactive element-oriented voice interaction method, which is characterized by comprising the following steps:

the method comprises the steps that a user wakes up a special voice mode by long-pressing interactive elements of a UI (user interface), wherein the special voice mode is a voice interaction mode facing to the interactive elements, and the interactive elements comprise all elements capable of being touched in the UI of a client;

respectively configuring corresponding instruction sets according to different interactive elements, presetting at least 3 instructions for each interactive element to form an instruction set, and storing the instruction set in a background database in advance;

the method for waking up the special voice mode specifically comprises the following steps:

selecting a target object for voice interaction by long-time pressing of an interaction element of the UI interface, and performing selection feedback; the long press comprises a single-finger long press, a multi-finger long press or long press dragging, and the long time is not less than 500ms; the selected feedback includes at least one of an audible cue, a visual change, or vibration; the selection area range of the interaction element is preset by a system and is associated with the instruction;

judging the interactive element type, judging whether the selected interactive element type supports a special voice mode, if so, waking up the special voice mode, otherwise, not executing waking up;

the voice information is identified, and the client responds to the instruction, and the specific method comprises the following steps:

if hit, fill the instruction main body, carry out the corresponding instruction; if the data is not hit, the data is not executed;

exiting the special voice mode;

the special voice mode of exiting is divided into active exiting and passive exiting; the active exit comprises stop long-press exit and voice exit, and the passive exit comprises interrupt exit by other processes and overtime exit;

the method further comprises the steps of: and collecting historical voice information and historical response instructions for the historical voice information, building a training model through machine learning, and displaying a recommendation instruction list on the UI interface after the special voice mode is awakened next time.

2. An interactive element-oriented voice interactive system is characterized by comprising a wake-up module, a voice recognition module, an instruction response module and an exit module;

the instruction execution unit is used for executing a response instruction;

3. A storage medium having stored thereon a computer program, which when executed by a processor implements an interactive element oriented speech interaction method according to claim 1.