CN115881119A

CN115881119A - Disambiguation method and system integrating rhythm characteristics, refrigeration equipment and storage medium

Info

Publication number: CN115881119A
Application number: CN202211406732.1A
Authority: CN
Inventors: 马坚; 刘卫强; 曾谁飞; 李敏; 孔令磊; 张景瑞
Original assignee: Qingdao Haier Refrigerator Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Refrigerator Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-31

Abstract

The invention relates to a disambiguation method for fusing prosodic features, which comprises the following steps: acquiring a voice conversation text and extracting text features of the voice conversation text; if the voice dialog text is ambiguous, acquiring corresponding voice information according to the voice dialog text; extracting prosodic features of the voice information; fusing the text features and the rhythm features to obtain voice text fusion features; and carrying out disambiguation processing on the voice text fusion characteristics based on a preset rule so as to identify the real intention and the slot position information corresponding to the voice text. The prosodic features are introduced to help disambiguation work, so that the limitation of identifying intentions and slot positions only by means of text features is avoided, the disambiguation capability of ambiguous sentences is improved, the intelligent interactive conversation efficiency of the refrigerator and a user is improved, and the experience effect of the user is improved.

Description

Disambiguation method and system integrating rhythm characteristics, refrigeration equipment and storage medium

Technical Field

The invention relates to the technical field of refrigerator voice assistants, in particular to a disambiguation method and system fusing prosodic features, refrigeration equipment and a storage medium.

Background

In the field of voice assistants, intention identification and slot information identification are two core tasks, and the disputes of intention identification or slot identification can be caused by a large number of ambiguous sentences in user question sentences of a refrigerator voice assistant. For example, example 1: the method has the advantages that the meaning recognition ambiguity exists when a plurality of apples are placed, the method can be understood as a statement sentence, a user places a plurality of apples in a refrigerator, the user does not specify the specific number, the number is small when the number is small, the method can also be understood as an inquiry sentence, and the user inquires how many apples are in the refrigerator at present. For another example, in example 2, "adding leek egg dumplings" has ambiguity of slot position identification, and may be understood as adding three foods of "leek, egg and dumpling" or one food of "dumplings with leek egg stuffing".

For the above ambiguity, it is impossible to distinguish from the text alone, and it is necessary to combine with the rhythm information for definite distinction, for example 1, when the overall sentence tone of the user voice is a tone down and the accent position is on "apple", it can be definitely determined as the intention of adding food material, and when the overall sentence tone for voice is a tone and the accent position is on "several", it can be definitely determined as the intention of inquiring the amount of food material; for example 2, if there is an obvious speech pause between three words of "leek, egg, dumpling", it can be judged that three foods are added, and if there is no obvious speech pause between three words and there is an obvious speech pause with the previous "add", it can be judged that one food is added.

At present, natural and language understanding models are mostly coded and trained on the basis of texts, and because texts do not contain prosody information, effective disambiguation judgment cannot be carried out when the ambiguous sentence models are encountered, so that recognition errors of intentions or slots are caused, and user experience is influenced. Therefore, how to make the refrigerator understand the user language more accurately, eliminate ambiguity, and realize more natural human-computer interaction becomes an urgent problem to be solved.

Disclosure of Invention

The invention aims to provide a disambiguation method, a disambiguation system, refrigeration equipment and a storage medium which are integrated with rhythm characteristics, and the disambiguation method, the disambiguation system, the refrigeration equipment and the storage medium can be used for solving the limitation problem that intention and slot position identification are carried out only by text characteristics.

In order to achieve the above object, the present invention provides a method for disambiguating a fused prosodic feature, the method comprising the steps of: acquiring a voice conversation text and extracting text features of the voice conversation text; if the voice dialog text has ambiguity, acquiring corresponding voice information according to the voice dialog text; obtaining the prosodic features of the voice dialog text according to the voice information; fusing the text features and the rhythm features to obtain voice dialog text fusion features; and carrying out disambiguation on the voice dialog text fusion feature based on a preset rule so as to identify the real intention and the slot position information corresponding to the voice dialog text.

As a further improvement of the present invention, the method further comprises: the text characteristics comprise word segmentation, part of speech tagging and named entity identification; the prosodic features include intonation, stress, pauses, and cadence in the speech information.

As a further improvement of the present invention, the step "if the voice dialog text is ambiguous, acquiring corresponding language information according to the voice dialog text" specifically includes: judging whether the speech dialogue text exists in an ambiguous sentence library or not; and if so, sending information to a session management module to acquire the voice information corresponding to the voice dialog text.

As a further improvement of the present invention, the step of "obtaining prosodic features of the speech dialogue text according to the speech information" specifically comprises: extracting audio features in the voice information; inputting the audio features into a pre-trained deep learning prosody recognition model to generate a voice text with prosody information; and extracting features according to the prosody information in the voice text to obtain the prosody features of the voice text.

As a further improvement of the present invention, the step of "fusing the text feature and the prosody feature to obtain a speech text fusion feature" specifically includes: carrying out linear change and normalization processing on the text characteristic and the prosody characteristic to generate a text characteristic matrix and a prosody characteristic matrix; and performing fusion splicing on the text feature matrix and the prosody feature matrix to obtain a fusion matrix corresponding to the voice text fusion feature.

As a further improvement of the present invention, the method further comprises: the rule file comprises module declaration rules and common rules; the module declaring rules are used for defining the types of rule modules and the execution sequence among the modules; the general rule comprises a conditional statement and a result statement and is used for defining the operation of executing the result statement on the corpus meeting the conditional statement.

As a further improvement of the present invention, the method further comprises: the intent identification and slot identification are independent of each other.

The invention also provides a disambiguation system fused with rhythm characteristics, which comprises: the system comprises an input module, an ambiguity judging module, a rhythm feature extracting module, a feature fusion module, a rule judging module and an output module.

The present invention also provides a refrigeration apparatus comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing the steps of the method for disambiguating a fused prosodic feature as described in any of the above when executing the program.

The present invention also provides a storage medium storing a computer program which, when executed by a processor, implements the steps in the method for disambiguating a fused prosodic feature as defined in any one of the above.

Compared with the prior art, the invention has the following beneficial effects: the method helps eliminate the ambiguity of the voice conversation text by introducing the prosodic features of the voice, avoids the limitation problem of identifying intentions and slot positions only by depending on the text features, improves the disambiguation capability of ambiguous sentences, improves the conversation efficiency of intelligent interaction between a refrigerator and a user, and improves the experience effect of the user. Meanwhile, the conventional text features and prosodic features are fused, the final disambiguation discrimination work of ambiguous sentences is realized by introducing linguistic rules and deep learning, and the linguistic rules are customized by the user according to the user requirements, so that the disambiguation is performed in a controllable range, and the accuracy of the disambiguation discrimination is improved.

Drawings

FIG. 1 is a flow chart of a method for disambiguating a fused prosodic feature according to an embodiment of the present invention.

FIG. 2 is a flow chart of a method for disambiguating fused prosodic features in an embodiment of the invention.

FIG. 3 is a schematic flow chart of constructing a deep learning prosody information model according to an embodiment of the present invention

Fig. 4 is a schematic flow chart of a network model for constructing fusion of text features and prosody features in the embodiment of the present invention.

Fig. 5 is a schematic flow chart of building a feature fusion network model in the embodiment of the present invention.

FIG. 6 is a schematic structural diagram of a disambiguation system incorporating prosodic features according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the detailed description of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention.

The present application discloses an embodiment of a method for disambiguating a fused prosodic feature, which, although providing the method operation steps described in the following embodiment or flowchart 1, is not limited to the order of execution provided in the embodiments of the present application, in which there are no logically necessary causal steps in the method, based on conventional or non-inventive labor.

As shown in fig. 1, an embodiment of the present invention provides a method for disambiguating a fused prosodic feature, where the method includes the following steps, and the method and each step are described below:

s1, acquiring a voice conversation text and extracting text features of the voice conversation text.

In one embodiment of the present invention, personal mobile phones, computers, and various types of smart furniture, such as smart refrigerators, smart speakers, etc., are often equipped with applications such as voice assistants. The application can realize voice interaction between a person and a machine, the intelligent refrigerator is preferable in the invention, and when a user sends interactive information to the refrigerator, the interactive information comprises that the voice assistant wants to do a thing, such as opening music; or execute a command, such as adding a schedule; or answer a question, for example, what the first capital of china calls, after receiving the voice information, the voice assistant first performs Automatic Speech Recognition (ASR), that is, converts the voice into a Text, then understands and processes the converted Text through Natural Language Processing (NLP), and finally makes a corresponding voice answer according To the result of understanding the Text through Text To Speech (TTS), thereby completing the interactive function with the user.

When obtaining and understanding a voice dialog text, a Natural Language Processing (NLP) method is used, which can extract conventional text features through a variety of techniques, including text preprocessing, lexical analysis, syntactic analysis, semantic understanding, word segmentation, part of speech tagging, text similarity, named entity recognition, etc., and usually the intention of a user can be recognized according to the conventional text features, but if the text is ambiguous, the intention may be mistakenly recognized by simply depending on the conventional text features.

And S2, if the voice dialog text has ambiguity, acquiring voice information corresponding to the voice dialog text.

In the process of generating corresponding voice conversation texts through voice recognition and natural language processing, a text may have a plurality of understandings, which is easy for people to generate ambiguity and misunderstanding, for example, "a plurality of apples are put in", and the sentence is a statement sentence or a question sentence which cannot be found out literally, so that the intelligent refrigerator is easy to be confused, the real intention or slot position information of the user cannot be recognized, the given answer does not necessarily meet the real requirement of the user, and the interaction efficiency and the user experience are reduced.

In an embodiment of the present invention, whether the speech dialog text exists in the ambiguous sentence library is determined according to a search in the speech dialog text disambiguation sentence library, and if the speech dialog text exists, which indicates that the speech dialog text is ambiguous, a further determination needs to be performed by using speech information corresponding to the speech dialog text. The ambiguity sentence library is used for recording and storing various ambiguity sentence information, which is a basis for ambiguity discrimination, and the information content in the library can be gradually improved in the disambiguation process.

And S3, acquiring prosodic features of the voice dialog text according to the voice information.

The prosodic features, also called "super-segment features" or "super-sound-quality features", are a phonetic system structure of a language, and are closely related to syntax, structure of language piece, information structure and other linguistic structures, specifically, the prosodic features refer to changes in pitch, duration, intensity, timbre and other aspects except for the sound quality features in speech. When spoken language interaction is performed, understanding of semantics and intention needs assistance of prosodic features, and the prosodic features are an important form for expressing real intention in a conversation interaction process.

In general, prosodic features mainly include three aspects, intonation, accent, and temporal distribution, where temporal distribution refers to pauses and continuations while speaking. In the speech dialog text with ambiguity, accent and intonation information is difficult to express, so in an embodiment of the present invention, if the speech dialog text has ambiguity, information is sent to a session management module to obtain speech information corresponding to the speech dialog text, audio features in the speech information are extracted, then the audio features are input into a pre-trained deep learning model to generate a speech text with prosody information, the speech text with prosody information is labeled with accents by using a text label with a specific format, such as "put several [ stress ] apples", and corresponding prosody features, such as "stress", are extracted according to the prosody information in the speech text, and the extraction of the prosody features is prepared for subsequently eliminating the ambiguity of the speech dialog text.

And S4, performing feature fusion on the text features and the prosodic features to generate voice dialog text fusion features.

And performing linear change and normalization processing on the text features and the prosodic features to generate a text feature matrix and a prosodic feature matrix, splicing and fusing the text feature matrix and the prosodic feature matrix to obtain a fusion matrix corresponding to the voice text fusion features, wherein the fusion features have the capability of judging and identifying intentions and slots compared with the text features and the prosodic features before fusion.

And S5, disambiguating the voice text fusion characteristics based on a predefined rule so as to identify the real intention and the slot position information corresponding to the voice text.

In one embodiment of the invention, based on the fusion matrix, the ambiguous places in the voice dialog text are eliminated through some preset rule judgment. The preset rules comprise module declaration rules and common rules, wherein the module declaration rules are used for defining the types of the rule modules and the execution sequence among the modules, for example, the preset rules comprise a text prosody labeling module and an intention module, the text prosody labeling module is executed firstly, then the intention module is executed, the intelligent refrigerator is labeled as equipment through the labeling module, the intention module capable of obtaining the corpus by combining the whole sentence pattern analysis is vegetable query, the method comprises the steps of firstly extracting and labeling information characteristics in the corpus through the labeling module, then obtaining the intention of the corpus and executing the intention module. The general rule comprises a conditional statement and a result statement and is used for defining the operation of executing the result statement on the corpus meeting the conditional statement.

FIG. 2 is a flow chart of a method for disambiguating a fused prosodic feature. In the embodiment of the invention, the intelligent refrigerator performs text conversion aiming at voice information sent by a user to generate corresponding text data, firstly, an ambiguity judging module judges whether the text data has ambiguity, if the text data does not belong to an ambiguous sentence, the text data is handed to a Natural Language Understanding (NLU) module to be processed, an intention and slot position identification result is output, and if the text data belongs to the ambiguous sentence, the disambiguation process of the invention is carried out.

On the other hand, as shown in fig. 2, after entering the disambiguation process, the prosody feature extraction module extracts the selected prosody information to obtain the prosody features. The selection of the prosody information is determined by specific scene data characteristics, not all prosody information needs to be extracted, for example, some prosody information has no value in discriminating ambiguity, so that the prosody information with no value is not recognized and extracted, the time is saved, the recognition and extraction efficiency of prosody features is improved, and the extracted prosody feature result is transmitted to a subsequent feature fusion module. The feature fusion module acquires prosodic feature information, adds the prosodic feature information into a feature matrix of the text information, such as word segmentation results and part of speech tagging, fuses into a new feature matrix, and transmits the new feature matrix to the subsequent rule judgment module. And after the rule judging module takes the fused feature matrix, resolving ambiguous sentences according to a preset rule and outputting correct intention and slot position identification results.

In a specific embodiment, the obtained voice dialog text is "with several apples in place", the text is subjected to Natural Language Processing (NLP), text processing including word segmentation and part-of-speech tagging is performed, and meanwhile, the tagging of prosodic information in the voice dialog text is obtained according to the corresponding voice information, and the following specific tagging information is generated:

{ "seg": [ "put", "has", "several", "apple", "@ @ @ @ @ @ @ @",

“pos”:[“v”,“ul”,“mq”,“n”,“”],

“prosodic”:[“”,“”,“stress”,“n”,“rise”]}

extracting text characteristics and prosodic characteristics, linearly changing and normalizing according to the labeling information to generate a corresponding text characteristic matrix and prosodic characteristic matrix, and performing characteristic fusion on the text characteristic matrix and the prosodic characteristic matrix to generate a fusion characteristic matrix, wherein a specific result is shown in fig. 5, a word segmentation result in the fusion characteristic matrix is obtained according to a 'seg' label, a part of speech label is obtained according to a 'pos' label, a prosodic characteristic is obtained according to a 'prosodic' label, and the fact that accents are added to 'several' in the dialog text can be known according to the fusion characteristic matrix, so that the meaning of the sentence understanding is a question sentence instead of a statement sentence, and the real question is sent to an intelligent refrigerator to ask how many apples are put into the refrigerator. Specifically, the sentence pattern of the dialog text is judged through syntactic characteristics such as the part of speech and the sequence thereof, and then the stress position is judged to distinguish which intention is to be disambiguated. Therefore, the addition of the prosodic features eliminates the ambiguity problem existing in the dialog text, and of course, other prosodic features are also included in the protection scope of the invention, and the extracted prosodic features are specifically analyzed for specific problems.

Fig. 3 is a schematic flow chart of constructing a deep learning prosody recognition model in the embodiment of the present invention, which specifically includes the following steps:

s31, collecting a plurality of dialogue voice information as a training data set.

And S32, extracting prosodic features in the training data set.

And S33, outputting prosody labels according to the prosody characteristics.

And S34, performing deep learning prosody recognition model training according to the prosody labels and the prosody labels, and continuously adjusting model parameters to generate a deep learning prosody recognition model.

In the embodiment of the invention, a prosody recognition model to be trained is constructed based on a deep learning algorithm, a plurality of pieces of dialogue voice information are collected as a training sample set, the dialogue voice information is preprocessed, wherein the preprocessing comprises eliminating mute noises at the head end and the tail end, reducing interference on subsequent steps, and the voice information is processed in a segmented manner to generate multi-frame data; and then extracting the features of the multi-frame data by utilizing the linear prediction cepstrum coefficient and the Mel cepstrum coefficient, extracting corresponding prosody information to obtain multi-dimensional prosody features, inputting the multi-dimensional prosody features into an acoustic model, outputting phoneme information related to the corresponding prosody, and obtaining related voice conversation text information by means of a dictionary and the mapping relation between characters or words and phonemes, wherein the conversation text information comprises prosody labels. And performing deep learning prosody recognition model training according to the prosody labels and the prosody labels, and continuously adjusting model parameters to generate a deep learning prosody recognition model.

The model extracts a part, namely prosodic features, in the voice conversation text to be recognized, which is highly related to the prosodic information, and automatically classifies and marks the prosodic information of the text data according to the prosodic features.

Fig. 4 is a schematic flow diagram of constructing a feature fusion network in the embodiment of the present invention, which specifically includes the following steps:

and S41, inputting the voice dialog text to be processed into the trained intention recognition model and the deep learning prosody recognition model to respectively obtain a text characteristic matrix and a prosody characteristic matrix.

And S42, inputting the text feature matrix and the prosody feature matrix into a feature fusion network to be trained to generate a fusion feature matrix.

S43, disambiguating according to the preset rule and the fusion feature matrix, and extracting the intention features and the slot features.

And S44, adjusting parameters of the fusion network according to the intention characteristics and the slot position characteristics to obtain a characteristic fusion network model.

The feature fusion network model can automatically output correct intention and slot position recognition results according to input text features and prosody features, integrates the prosody features into the text features, eliminates ambiguity problems generated by voice conversation texts, enables the fused features to have stronger intention and slot position recognition capability and higher robustness, and realizes the independent processes of intention recognition and slot position information recognition.

An embodiment of the present invention provides a system for disambiguating fused prosodic features, as shown in fig. 6, including:

the input module 100 performs voice recognition and natural language processing on voice sent by a user to the intelligent refrigerator to obtain a corresponding voice conversation text.

The ambiguity judging module 200 is configured to compare the obtained speech dialog text with an ambiguous sentence recorded in the ambiguity sentence library, and judge whether the dialog text belongs to the ambiguous sentence.

The prosodic feature extracting module 300 is configured to label and extract prosodic features of the ambiguous voice dialog text according to the corresponding voice information, so as to generate prosodic features.

And a feature fusion module 400, configured to fuse the prosodic features and the text features into the same feature matrix for subsequent rule processing.

The rule discrimination module 500 is configured to eliminate ambiguity and obtain accurate and unique intention and slot position information through a preset rule definition based on the fusion feature matrix.

And an output module 600, configured to convert the identified intention information and slot position information into corresponding voice information and respond to the user.

The embodiment of the present invention further provides a refrigeration device, where the refrigeration device includes a memory and a processor, where the memory stores instructions, and the processor invokes the instructions in the memory, so that when executed, the refrigeration device implements the method for disambiguating the fused prosody characteristic as described in any of the above.

An embodiment of the present invention further provides a storage medium, where the storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements the method for disambiguating the fusion prosodic feature as described in any of the above.

In summary, the invention provides a disambiguation method, system, refrigeration equipment and storage medium with prosody features fused, which help disambiguation of voice conversation texts by introducing the prosody features, thereby avoiding the limitation of identifying intentions and slot positions only by relying on text features, improving the capability of eliminating ambiguous sentences, and improving user experience. The text features and the prosodic features are fused, the two elements with strong correlation with the user intention and slot position identification are comprehensively considered, and the robustness and the accuracy of the disambiguation method are improved.

In addition, the invention also introduces linguistic rules to realize final disambiguation discrimination work, thereby ensuring that the disambiguation is carried out in a controllable range.

It should be understood that although the present description refers to embodiments, not every embodiment contains only a single technical solution, and such description is for clarity only, and those skilled in the art should make the description as a whole, and the technical solutions in the embodiments can also be combined appropriately to form other embodiments understood by those skilled in the art.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention and is not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A disambiguation method of fused prosodic features,

the method comprises the following steps:

acquiring a voice conversation text and extracting text features of the voice conversation text;

if the voice dialog text has ambiguity, acquiring corresponding voice information according to the voice dialog text;

obtaining the prosodic features of the voice dialog text according to the voice information;

performing feature fusion on the text features and the rhythm features to obtain voice dialog text fusion features; and carrying out disambiguation processing on the voice dialog text fusion feature based on a preset rule so as to identify the real intention and the slot position information corresponding to the voice dialog text.

2. The method of disambiguating a fused prosodic feature of claim 1,

the method further comprises the following steps:

the text characteristics comprise word segmentation, part of speech tagging and named entity identification;

the prosodic features include intonation, stress, pauses, and cadence in the speech information.

3. The method of disambiguating fused prosodic features of claim 1,

the "if the voice dialog text is ambiguous, acquiring corresponding voice information according to the voice dialog text" specifically includes:

judging whether the speech dialogue text exists in an ambiguous sentence library or not;

and if so, sending information to a session management module to acquire the voice information corresponding to the voice dialog text.

4. The method of disambiguating fused prosodic features of claim 1,

the "obtaining the prosodic features of the voice dialog text according to the voice information" specifically includes: extracting audio features in the voice information;

inputting the audio features into a pre-trained deep learning prosody recognition model to generate a voice text with prosody information;

and extracting features according to the prosody information in the voice text to obtain the prosody features of the voice text.

5. The method of disambiguating a fused prosodic feature of claim 1,

the step of performing feature fusion on the text features and the prosody features to obtain the voice dialog text fusion features specifically comprises the following steps:

carrying out linear change and normalization processing on the text characteristic and the prosody characteristic to generate a text characteristic matrix and a prosody characteristic matrix;

and splicing and fusing the text feature matrix and the rhythm feature matrix to obtain a fusion matrix corresponding to the voice text fusion feature.

6. The method of disambiguating a fused prosodic feature of claim 1,

the rule file comprises module declaration rules and common rules;

the module declaring rules are used for defining the types of rule modules and the execution sequence among the modules; the general rules comprise conditional statements and result statements and are used for defining the operation of executing the result statements on the linguistic data meeting the conditional statements.

7. The method of disambiguating a fused prosodic feature of claim 1,

the method comprises the following steps:

the intent identification and slot identification are independent of each other.

8. A disambiguation system incorporating prosodic features, the system comprising: the system comprises an input module, an ambiguity judging module, a rhythm feature extracting module, a feature fusion module, a rule judging module and an output module.

9. A refrigeration device, comprising a memory storing a computer program operable on the processor and a processor that when executed implements the method of disambiguating a fused prosodic feature of any one of claims 1-7.

10. A storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps in the method for disambiguating the fused prosodic feature of any one of claims 1-7.