CN112185374A

CN112185374A - Method and device for determining voice intention

Info

Publication number: CN112185374A
Application number: CN202010929640.6A
Authority: CN
Inventors: 杨洋
Original assignee: Beijing Ruying Intelligent Technology Co ltd
Current assignee: Beijing Ruying Intelligent Technology Co ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2021-01-05

Abstract

The invention discloses a method and a device for determining a voice intention, which are used for obtaining a more accurate voice intention and are beneficial to realizing accurate voice control. The method comprises the following steps: obtaining an input voice; obtaining a context associated with the speech; matching the voice content and the scene context of the voice with a preset intention template; and determining the voice intention according to the matched and consistent intention template.

Description

Method and device for determining voice intention

Technical Field

The present invention relates to the field of computer and communication technologies, and in particular, to a method and an apparatus for determining a voice intention.

Background

Artificial intelligence technology is one of the important technical fields of current research. Image technology and voice technology are two important basic technologies for artificial intelligence. Among them, how to more accurately understand the speaking intention of a speaker is an important research direction of speech technology. One common processing method is to convert the user's voice into text, and then perform sentence structure analysis on the text, thereby determining the voice intention. The speech intent obtained in this way is not sufficiently accurate.

Disclosure of Invention

The invention provides a method and a device for determining a voice intention, which are used for obtaining a more accurate voice intention and are beneficial to realizing accurate voice control.

The invention provides a method for determining a voice intention, which comprises the following steps:

obtaining an input voice;

obtaining a context associated with the speech;

matching the voice content and the scene context of the voice with a preset intention template;

and determining the voice intention according to the matched and consistent intention template.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects: the embodiment combines the contextual information except the voice content, so that the voice intention obtained by analysis is more accurate.

Optionally, the context includes at least one of: the application module is used for obtaining the time of the voice, the position of the user providing the voice, the environment information of the environment where the user is located, the user image information of the user and the application state information.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects: the embodiment acquires various scene information aiming at the scene where the user providing the voice is located so as to analyze the voice intention of the user from more dimensions.

Optionally, the method further comprises at least one of:

carrying out acoustic analysis on the voice to obtain acoustic characteristic information corresponding to voice content;

obtaining a dialog context for the speech;

matching the voice with historical voice of a user providing the voice to obtain historical information;

the matching of the voice content of the voice and the contextual context with a preset intention template comprises:

matching the voice content of the voice and the scene context with at least one of the following information and a preset intention template; wherein the following information includes: the acoustic feature information, the dialog context, and the history information.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects: the embodiment further combines the acoustic characteristics of the voice, the context of the conversation and the historical habits of the user to more accurately analyze the voice intention of the user.

Optionally, when there are at least two intent templates that match identically, the method further includes:

scoring the at least two intention templates according to a plurality of preset scoring modes and the priority of each scoring mode;

determining the voice intention according to the intention template which is matched with the voice intention template, wherein the method comprises the following steps:

and when the highest-grade intention template is obtained, determining the voice intention according to the highest-grade intention template.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects: when a plurality of intention templates are matched, the embodiment can adopt a plurality of scoring modes to select a better intention template, so as to more accurately determine the voice intention.

Optionally, the multiple scoring modes include, in order of priority from high to low: a template scoring mode, a lexical scoring mode and a syntactic scoring mode.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects: the embodiment provides multiple and ordered scoring modes, and is a multi-angle scoring mode.

The invention provides a device for determining voice intention, which comprises:

the voice module is used for obtaining input voice;

a context module to obtain a context associated with the speech;

the matching module is used for matching the voice content and the scene context of the voice with a preset intention template;

and the intention module is used for determining the voice intention according to the matched and consistent intention template.

Optionally, the apparatus further comprises at least one of:

the acoustic module is used for carrying out acoustic analysis on the voice to obtain acoustic characteristic information corresponding to voice content;

the dialogue module is used for obtaining dialogue context of the voice;

the history module is used for matching the voice with the history voice of the user providing the voice to obtain history information;

the matching module includes:

the matching sub-module is used for matching the voice content of the voice, the scene context and at least one of the following information with a preset intention template; wherein the following information includes: the acoustic feature information, the dialog context, and the history information.

Optionally, when there are at least two intent templates that match identically, the apparatus further includes:

the scoring module is used for scoring the at least two intention templates according to a plurality of preset scoring modes and the priority of each scoring mode;

the intent module includes:

and the intention submodule is used for determining the voice intention according to the highest-grade intention template when the highest-grade intention template is obtained.

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

obtaining an input voice;

obtaining a context associated with the speech;

The present invention provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method for determining intent to speak in an embodiment of the present invention;

FIG. 2 is a flow chart of a method for determining intent to speak in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a method for determining intent to speak in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of an apparatus for determining intent to speak in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus for determining intent to speak in accordance with an embodiment of the present invention;

FIG. 6 is a block diagram of a matching module in an embodiment of the invention;

FIG. 7 is a block diagram of an apparatus for determining intent to speak in accordance with an embodiment of the present invention;

FIG. 8 is a block diagram of an intent module in an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

In the related art, how to more accurately understand the speaking intention of a speaker is an important research direction of the speech technology. One common processing method is to convert the user's voice into text, and then perform sentence structure analysis on the text, thereby determining the voice intention. The speech intent obtained in this way is not sufficiently accurate.

In order to solve the above problem, the embodiment combines the context except the voice content, so as to more accurately analyze the voice intention of the user.

Referring to fig. 1, the method for determining a voice intention in the present embodiment includes:

step 101: an input speech is obtained.

Step 102: a contextual context associated with the speech is obtained.

Step 103: and matching the voice content of the voice and the scene context with a preset intention template.

Step 104: and determining the voice intention according to the matched and consistent intention template.

If no matching intention template exists, the process is ended, and a notification indicating that the voice recognition is failed can be fed back to the user.

The execution main body of this embodiment may be a central control device in a home of a user, and the received voice may be a wake-up voice to wake up an application module, or may be a command voice to input a control command to an application module. The voice-controlled application module can be an application module in the central control device, and can also be an application module in the intelligent device which has a network connection relationship with the central control device.

On the basis of the voice content, the embodiment adds the context, namely the context information of the current situation of the user is added, so that a more appropriate intention template is matched, and the voice intention of the user is determined more accurately.

The context information included in the context is complementary to the voice content. For example, the voice content is opening a door. The application module currently acquiring the application state information is a video call application of the unit access control. The user's voice intention is to open the cell door, not the door at the user's home.

The location of the user includes a geographic location, a home location, and the like. The geographic location may be latitude and longitude coordinates, or province, city, district, street. A home location such as a bedroom, living room, etc. The position can be described from multiple angles.

The environmental information includes temperature information, weather information, and the like.

The user portrait information includes: age, gender, occupation, family role (e.g., father), etc.

The application module may be a local application module, and when a certain function of the application module is triggered, the application module sends the current application state information to the speech processing front end through the operating system. When the application module is an application module in an external intelligent device, for example, the application module is a local central control device in a home, and the application module is an intelligent device such as an alarm clock, an entrance guard, a sound box and the like. When a certain function of the application module is triggered, the application module sends the current application state information to the voice processing front end through the network. The application state information includes a sleep state or an active state, etc.

Optionally, the method further comprises at least one of: step a 1-step A3.

Step A1: and carrying out acoustic analysis on the voice to obtain acoustic characteristic information corresponding to the voice content.

Step A2: a dialog context for the speech is obtained.

Step A3: and matching the voice with the historical voice of the user providing the voice to obtain historical information.

The step 103 comprises: step a 4.

Step A4: matching the voice content of the voice and the scene context with at least one of the following information and a preset intention template; wherein the following information includes: the acoustic feature information, the dialog context, and the history information.

The acoustic feature information in the present embodiment includes a speech rate, an accent, and the like. When a person speaks, the key words can be emphasized and slowed down. This feature is helpful for analyzing user intent.

The dialog context includes successive pieces of speech content of the same user. In a round of conversation, the front and back utterances of a user are often strongly correlated, and the intention analysis can be assisted.

The historical information can reflect the language habits, living habits and the like of the user. The currently received voice can be matched with the historical voice of the user, and when the matching is consistent, the historical information is obtained from the historical voice which is consistent in matching. Wherein the similarity threshold when the matches may be set lower. And, when matching with the historical speech, the historical speech at the same point in time can be matched. For example, the current time is 7 am, the received voice is "on light", the voice is matched with the historical voice at 7 am in the past 5 days (within the preset time period), the historical voice is "on light", the result is consistent, and the historical information includes "toilet".

The present embodiment also supplements the acoustic feature information, the dialog context, and the history information on the basis of the speech content and the context. The method and the device are beneficial to more accurately matching the intention template to the intention template with higher quality, and further more accurate analysis can be carried out to obtain the voice intention of the user.

Optionally, when there are at least two intent templates that match identically, the method further includes: step B1.

Step B1: and scoring the at least two intention templates according to a plurality of preset scoring modes and the priority of each scoring mode.

The step 104 comprises: step B2.

Step B2: and when the highest-grade intention template is obtained, determining the voice intention according to the highest-grade intention template.

In this embodiment, the intention template includes a slot position and a value of the slot position, and there may be a plurality of slot positions. For example, the intent template is shaped as: { action: opening, room: bedroom, equipment: desk lamp, the action, room and equipment are the slot positions, and the opening, bedroom and desk lamp are the corresponding values. The received speech is "turn on bedroom lights", intention template 1{ action: opening, the device: lamp, intention template 2{ action: opening, the device: desk lamp, intention template 3{ action: opening, room: bedroom, equipment: desk lamp }. The speech and the intent templates 1-3 may all match. At this point, a higher quality intent template needs to be selected to more accurately determine the intent of the speech.

The embodiment adopts a plurality of scoring modes, and adopts preset scoring modes in sequence from high priority to low priority. And when a plurality of intention templates with the highest scores are obtained by adopting a certain scoring mode, continuing scoring by adopting a next-level scoring mode. And when one intention template with the highest score is obtained by adopting a certain scoring mode, ending the step and not adopting the subsequent scoring mode to continue scoring. The embodiment can obtain the intention template with the highest quality in a scoring mode, and further obtain a more accurate voice intention.

In this embodiment, the template scoring mode is to score all intention templates in advance, and after the matching intention templates are determined, the scores of the intention templates can be obtained.

The lexical scoring mode is that the number and quality of slots matched in the intent templates with consistent matching are analyzed and scored. The greater the number of slots matched, the higher the score. The higher the quality of the matched slot position is, the higher the score is, and the more the real intention of the user can be reflected. The quality is embodied on the part of speech (verb, noun, etc.) of the slot, and the more diversified the part of speech, the higher the quality, and the more the real intention of the user can be embodied.

The syntactic scoring mode is that the matched and consistent intention template has a more complex syntactic structure, so that the intention template can reflect the intention of the user more completely. For example: if the intention template can extract complex syntactic structures such as antisense and question reversal, the quality is higher and the score is higher.

The implementation is described in detail below by way of several embodiments.

Referring to fig. 2, the method for determining a voice intention in the present embodiment includes:

step 201: an input speech is obtained.

Step 202: a contextual context associated with the speech is obtained. The context includes at least one of: the application module is used for obtaining the time of the voice, the position of the user providing the voice, the environment information of the environment where the user is located, the user image information of the user and the application state information.

Step 203: and carrying out acoustic analysis on the voice to obtain acoustic characteristic information corresponding to the voice content.

Step 204: a dialog context for the speech is obtained.

Step 205: and matching the voice with the historical voice of the user providing the voice to obtain historical information.

The steps 202 to 205 are relatively independent steps, and the execution sequence may be interchanged or performed synchronously.

Step 206: matching the voice content of the voice and the scene context with at least one of the following information and a preset intention template; wherein the following information includes: the acoustic feature information, the dialog context, and the history information.

Step 207: and determining the voice intention according to the matched and consistent intention template.

Referring to fig. 3, the method for determining a voice intention in the present embodiment includes:

step 301: an input speech is obtained.

Step 302: a contextual context associated with the speech is obtained. The context includes at least one of: the application module is used for obtaining the time of the voice, the position of the user providing the voice, the environment information of the environment where the user is located, the user image information of the user and the application state information.

Step 303: and carrying out acoustic analysis on the voice to obtain acoustic characteristic information corresponding to the voice content.

Step 304: a dialog context for the speech is obtained.

Step 305: and matching the voice with the historical voice of the user providing the voice to obtain historical information.

The steps 302 to 305 are relatively independent steps, and the execution sequence can be interchanged or can be performed synchronously.

Step 306: matching the voice content of the voice and the scene context with at least one of the following information and a preset intention template; wherein the following information includes: the acoustic feature information, the dialog context, and the history information.

Step 307: when at least two matched intention templates exist, scoring is carried out on the at least two intention templates according to multiple preset scoring modes and the priority of each scoring mode.

Step 308: and when the highest-grade intention template is obtained, determining the voice intention according to the highest-grade intention template.

The above embodiments can be freely combined according to actual needs.

The implementation of determining the intent of speech, which can be implemented by the device, is described above, and the internal structure and function of the device are described below.

Referring to fig. 4, the apparatus for determining a voice intention in the present embodiment includes: a speech module 401, a context module 402, a matching module 403 and an intent module 404.

A voice module 401, configured to obtain an input voice.

A context module 402 configured to obtain a context associated with the speech.

A matching module 403, configured to match the voice content of the voice and the context with a preset intention template.

An intent module 404 for determining a speech intent based on the matching intent templates.

Optionally, as shown in fig. 5, the apparatus further includes at least one of the following: an acoustic module 501, a dialogue module 502, and a history module 503.

The acoustic module 501 is configured to perform acoustic analysis on the speech to obtain acoustic feature information corresponding to speech content.

A dialog module 502 for obtaining a dialog context for the speech.

A history module 503, configured to match the voice with a history voice of a user providing the voice, so as to obtain history information.

As shown in fig. 6, the matching module 403 includes: a matching sub-module 601.

A matching sub-module 601, configured to match the speech content of the speech and the context, and combine at least one of the following information with a preset intention template; wherein the following information includes: the acoustic feature information, the dialog context, and the history information.

Optionally, as shown in fig. 7, when there are at least two matching intention templates, the apparatus further includes: a scoring module 701.

The scoring module 701 is configured to score the at least two intention templates according to a plurality of preset scoring manners and priorities of the scoring manners.

As shown in fig. 8, the intent module 404 includes: intention submodule 801.

The intention submodule 801 is configured to, when a highest-score intention template is obtained, determine a speech intention according to the highest-score intention template.

An apparatus to determine a speech intent, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

obtaining an input voice;

obtaining a context associated with the speech;

A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of determining a speech intent, comprising:

obtaining an input voice;

obtaining a context associated with the speech;

2. The method of claim 1, wherein the context comprises at least one of: the application module is used for obtaining the time of the voice, the position of the user providing the voice, the environment information of the environment where the user is located, the user image information of the user and the application state information.

3. The method of claim 1, wherein the method further comprises at least one of:

obtaining a dialog context for the speech;

4. The method of claim 1, wherein when there are at least two intent templates that match identically, the method further comprises:

5. The method of claim 4, wherein the plurality of scoring methods comprises, in order of priority: a template scoring mode, a lexical scoring mode and a syntactic scoring mode.

6. An apparatus for determining intent of speech, comprising:

the voice module is used for obtaining input voice;

a context module to obtain a context associated with the speech;

7. The apparatus of claim 6, wherein the context comprises at least one of: the application module is used for obtaining the time of the voice, the position of the user providing the voice, the environment information of the environment where the user is located, the user image information of the user and the application state information.

8. The apparatus of claim 6, wherein the apparatus further comprises at least one of:

the dialogue module is used for obtaining dialogue context of the voice;

the matching module includes:

9. The apparatus of claim 6, wherein when there are at least two intent templates that match identically, the apparatus further comprises:

the intent module includes:

10. The apparatus of claim 9, wherein the plurality of scoring methods comprises, in order of priority: a template scoring mode, a lexical scoring mode and a syntactic scoring mode.

11. An apparatus for determining intent of speech, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

obtaining an input voice;

obtaining a context associated with the speech;

12. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the steps of the method of any one of claims 1 to 5.