CN115641837A

CN115641837A - Intelligent robot conversation intention recognition method and system

Info

Publication number: CN115641837A
Application number: CN202211652800.2A
Authority: CN
Inventors: 吴伟
Original assignee: Beijing Ifudata Information Technology Co ltd
Current assignee: Beijing Ifudata Information Technology Co ltd
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-01-24

Abstract

The invention relates to the field related to voice recognition, and discloses a method and a system for recognizing dialog intentions of an intelligent robot; the semantic prediction module, the object detection module, the voice emotion module and the semantic judgment module are arranged, the purpose that on the basis of semantic analysis and recognition in the prior art, physiological states such as facial micro-expressions of a user are monitored to obtain current emotional changes of the user, voice emotion of the user is analyzed to further determine the current emotional changes of the user, semantic judgment of voice content in combination with the emotional changes is conducted through the emotional states of the user, the accuracy of semantic recognition is improved, and the problem that effective recognition is difficult to achieve when ambiguity of various different scenes exists in the semantic meanings in the prior art can be effectively solved.

Description

Intelligent robot dialogue intention recognition method and system

Technical Field

The invention relates to the field related to voice recognition, in particular to a method and a system for recognizing dialog intentions of an intelligent robot.

Background

With the rapid development of computer technology and robot technology, speech recognition systems and intelligent robot technologies equipped with speech recognition systems are becoming more sophisticated, and robots equipped with speech recognition systems in the prior art can recognize speech contents spoken by users and perform corresponding control or response according to recognition results.

In the prior art, most voice recognition methods are to collect and recognize voice, convert voice information into text information, extract and combine key contents of the text information to judge the semantic meaning, and then execute corresponding response steps or generate feedback response and output according to the judgment result of the semantic meaning.

In the prior art, when a semantic meaning is identified, a situation that key content comprises a plurality of semantic meanings exists, which is often reflected by emotional environments of users, that is, contents expressed by similar semantic meanings under different emotional states of the users are different, and the semantic meaning identification method in the prior art cannot effectively judge the semantic meanings, so that a situation that the semantic meaning identification is wrong occurs.

Disclosure of Invention

The invention aims to provide a method and a system for recognizing dialog intentions of an intelligent robot, which aim to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

an intelligent robotic dialog intent recognition system, comprising:

the semantic prediction module is used for acquiring voice contents, analyzing the semantic meanings through a language neural network algorithm and generating semantic prediction results, wherein the number of the semantic prediction results is multiple, and the semantic prediction results are used for representing information contents expressed by the voice contents;

the object detection module is used for acquiring voice contents, identifying and converting the voice contents into character contents, analyzing the character contents through a language neural network algorithm and generating semantic prediction results, wherein the number of the semantic prediction results is multiple, and the semantic prediction results are used for representing information contents expressed by the voice contents;

the voice emotion module is used for analyzing emotion characteristics of the voice content according to the voice emotion database to generate object voice emotion, the object voice emotion represents emotion change expressed by voice characteristics of an object, and the object voice emotion is used for assisting judgment of the semantic prediction result;

and the semantic judgment module is used for carrying out auxiliary content judgment on the semantic prediction result according to the object speech emotion and the emotion analysis result to generate a speech recognition result, and the speech recognition result is used for representing the information content expressed by the speech content.

As a further scheme of the invention: the voice emotion module comprises:

the resampling unit is used for resampling the voice content according to a preset sampling frequency to generate a voice sampling result, wherein the sampling frequency is determined by the processing efficiency requirement on the voice content and the accuracy requirement on emotional feature analysis, the preset sampling frequency is not greater than the inherent sampling frequency of the voice content, and the signal type of the voice sampling result is a discrete time signal;

the quantization coding unit is used for carrying out hierarchical quantization on the voice sampling result and carrying out binary coding to generate a voice digital signal, and the hierarchical quantization is used for converting a discrete time signal into a digital signal;

the feature analysis unit is used for establishing a spectrogram according to the voice digital signal, and performing feature selection on the voice digital signal according to the spectrogram to generate voice frequency features;

and the emotion judging unit is used for comparing and analyzing the voice frequency characteristics through a preset voice emotion database to generate an emotion analysis result.

As a further scheme of the invention: the voice emotion database comprises a standard emotion database and a personalized emotion database;

the standard emotion database is used for storing preset voice emotion data, and the preset voice emotion data comprises voice frequency features and emotion analysis results corresponding to the voice frequency features;

the personalized emotion database is used for storing user voice emotion data, and the user voice emotion data are used for representing the correspondence between the voice frequency characteristics of the user and the emotion analysis result.

As a further scheme of the invention: still include and judge the feedback module, it includes to judge the feedback module:

the judging request unit is used for outputting a feedback request and receiving a feedback signal, and the feedback signal is used for representing whether the voice recognition result is accurate or not;

and the feedback execution unit is used for responding to the feedback signal, and updating and expanding the personalized voice emotion database according to the voice characteristic frequency and the emotion analysis result if the feedback signal is judged to be accurate.

As a still further scheme of the invention: the object physiological information comprises user facial expression information and user physiological information, the user facial information is used for recording the facial micro-expression changes of a user, the user physiological information is used for representing the physiological state changes of the user, and the physiological state comprises heartbeat blood pressure and the like.

The embodiment of the invention aims to provide an intelligent robot conversation intention identification method, which is characterized by comprising the following steps:

acquiring voice content, identifying and converting the voice content into text content, analyzing the text content through a linguistic neural network algorithm, and generating semantic prediction results, wherein the number of the semantic prediction results is multiple, and the semantic prediction results are used for representing information content expressed by the voice content;

acquiring physiological information of a subject, and performing physiological feature analysis on the subject through an emotional neural network algorithm to generate an emotion analysis result, wherein the emotion analysis result represents emotion changes of the subject expressed through facial micro-expressions and physiological responses, and is used for assisting judgment of the semantic prediction result;

performing emotion feature analysis on the voice content according to a voice emotion database to generate a target voice emotion, wherein the target voice emotion represents emotion change expressed by voice features of a target, and the target voice emotion is used for assisting judgment of the semantic prediction result;

and judging the auxiliary content of the semantic prediction result according to the voice emotion of the object and the emotion analysis result to generate a voice recognition result, wherein the voice recognition result is used for representing the information content expressed by the voice content.

As a further scheme of the invention: the steps of acquiring the physiological information of the object, analyzing the physiological characteristics of the object through an emotional neural network algorithm and generating an emotional analysis result comprise:

resampling the voice content according to a preset sampling frequency to generate a voice sampling result, wherein the sampling frequency is determined by the processing efficiency requirement on the voice content and the accuracy requirement on emotional feature analysis, the preset sampling frequency is not greater than the inherent sampling frequency of the voice content, and the signal type of the voice sampling result is a discrete time signal;

carrying out hierarchical quantization on the voice sampling result, carrying out binary coding, and generating a voice digital signal, wherein the hierarchical quantization is used for converting a discrete time signal into a digital signal;

establishing a spectrogram according to the voice digital signal, and performing feature selection on the voice digital signal according to the spectrogram to generate voice frequency features;

and comparing and analyzing the voice frequency characteristics through a preset voice emotion database to generate an emotion analysis result.

Compared with the prior art, the invention has the beneficial effects that: the semantic prediction module, the object detection module, the voice emotion module and the semantic judgment module are arranged, so that the physiological states of facial micro-expressions and the like of a user are monitored to obtain the current emotional changes of the user on the basis of semantic analysis and recognition in the prior art, the voice emotion of the user is analyzed to further determine the current emotional changes of the user, semantic judgment of voice content in combination with the emotional changes is carried out on the voice content through the emotional states of the user, the accuracy of semantic recognition is improved, and the problem that effective recognition is difficult to achieve when the semantic meanings have ambiguity in various different scenes in the prior art can be effectively solved.

Drawings

Fig. 1 is a block diagram showing the structure of an intelligent robot dialogue intention recognition system.

Fig. 2 is a block diagram of a speech emotion module in an intelligent robot dialogue intention recognition system.

Fig. 3 is a flow chart of a method for recognizing dialog intentions of an intelligent robot.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Specific embodiments of the present invention are described in detail below with reference to specific examples.

As shown in fig. 1, an intelligent robot dialog intention recognition system according to an embodiment of the present invention includes:

the semantic prediction module 100 is configured to acquire voice content, recognize and convert the voice content into text content, analyze the text content through a linguistic neural network algorithm, and generate a semantic prediction result, where the number of the semantic prediction results is multiple, and the semantic prediction result is used to represent information content expressed by the voice content.

In this embodiment, the semantic prediction module 100 is basically the same as the semantic analysis method adopted by the intelligent robot in the prior art, except that after the speech is acquired and the semantic is recognized by a text-to-text method in the prior art, a semantic prediction result is usually determined and outputted by analysis to represent information expressed by the speech content (the information may be used for generating a dialog feedback response, that is, for the robot to understand the speech content), while the present application generates a plurality of results with higher probability, and then further determines the semantic according to analysis from other angles, because the conventional technology ignores the change of human emotional feeling in the dialog occasion only according to the understanding manner of the text content, and the change of the aspect often causes the change of the semantic content, so the prior art has a certain unavoidable factor that causes recognition errors.

The object detection module 300 is configured to obtain physiological information of an object, perform physiological feature analysis on the object through an emotional neural network algorithm, and generate an emotion analysis result, where the emotion analysis result represents emotion changes expressed by the object through facial microexpression and physiological responses, and the emotion analysis result is used to assist in determining the semantic prediction result.

In this embodiment, the object detection module 300 is used to monitor the user, and may be implemented by a sensor, a camera, or other devices, and when in use, the emotional state of the user is analyzed by the feature changes such as the facial micro-expression of the user, and the partial physiological features (for example, the heartbeat, etc.) of the user, so as to obtain the emotional state of the user.

And the voice emotion module 500 is used for performing emotion feature analysis on the voice content according to the voice emotion database to generate object voice emotion, wherein the object voice emotion represents emotion change expressed by voice features of an object, and the object voice emotion is used for assisting judgment of the semantic prediction result.

In this embodiment, the speech emotion module 500 is used to perform further emotion analysis on the speech content, because the speech uttered by a person in different emotional states is different in frequency, pitch, and other data, and thus by analyzing these data, the emotional state of the user can be further analyzed to some extent, so as to assist the object detection module 300 in further determining the speech content.

And a semantic meaning determining module 700, configured to perform auxiliary content predication on the semantic meaning prediction result according to the target speech emotion and the emotion analysis result, and generate a speech recognition result, where the speech recognition result is used to represent information content expressed by the speech content.

In this embodiment, the voice determination module 700 determines a specific semantic result carried by the voice content by analyzing the physiological response (mainly, the response of the change of the facial feature to the emotion) of the user and the feature of the voice (voice tone, frequency, etc.) based on the foregoing emotional state determination, and specifically, the semantic environment of the meaning expressed by the plurality of semantic prediction results is different, and the emotion of the user is one of the variables representing this environment.

As shown in fig. 2, as another preferred embodiment of the present invention, the speech emotion module 500 includes:

a resampling unit 501, configured to resample the voice content according to a preset sampling frequency, and generate a voice sampling result, where the sampling frequency is determined by a requirement for processing efficiency of the voice content and a requirement for accuracy of emotion feature analysis, the preset sampling frequency is not greater than a natural sampling frequency of the voice content, and a signal type of the voice sampling result is a discrete time signal.

A quantization coding unit 502, configured to perform hierarchical quantization on the voice sampling result and perform binary coding to generate a voice digital signal, where the hierarchical quantization is used to convert a discrete-time signal into a digital signal.

The feature analysis unit 503 is configured to establish a spectrogram according to the voice digital signal, and perform feature selection on the voice digital signal according to the spectrogram to generate a voice frequency feature.

And an emotion judging unit 504, configured to compare and analyze the voice frequency features through a preset voice emotion database, and generate an emotion analysis result.

In this embodiment, the speech emotion module 500 is simply divided functionally, and mainly includes steps of sampling, quantizing, encoding, feature extraction, and emotion judgment, where sampling refers to sampling a sound at a certain sampling frequency according to a requirement, because the sound is continuous (speech content is generated by collection, and therefore has a certain code rate, and at this time, it is based on re-sampling, and therefore the speech content is required to have a higher code rate to meet the requirement of the sampling frequency during re-sampling), and during processing, only limited data can be processed, and therefore the sound needs to be sampled at a certain sampling frequency; the step quantization is to divide the amplitude of the signal sample into several sections, classify the sample value of each section, and give the corresponding quantization value, and then perform binary coding on the quantization value, and then select the characteristics of the quantization value for judging and analyzing emotion.

As another preferred embodiment of the invention, the voice emotion database comprises a standard emotion database and a personalized emotion database;

the standard emotion database is used for storing preset voice emotion data, and the preset voice emotion data comprises voice frequency features and emotion analysis results corresponding to the voice frequency features.

In this embodiment, the standard emotion database is a standard comparison database that can adapt to more users on a more balanced standard, which is obtained after a large amount of data is learned, and the personalized emotion database is used for learning and updating according to voice characteristics and the like of the users after the users use the database for a long time, so that the system has a more personalized emotion database that can better adapt to the users.

As another preferred embodiment of the present invention, the present invention further comprises a determination feedback module, wherein the determination feedback module comprises:

and the judgment request unit is used for outputting a feedback request and receiving a feedback signal, and the feedback signal is used for representing whether the voice recognition result is accurate or not.

In this embodiment, the determination feedback module may be understood as a system in which a user determines a robot recognition result, and the system is internally matched with related units and steps, the determination request unit may output a voice recognition result (after establishing a feedback request connection with a user-side interaction device), and then receive a feedback signal of the user for the voice recognition result, where the feedback signal is controlled and established by the user, for example, the user uses a tablet device to establish the feedback request connection, the tablet device may pull up a feedback program, and the user selects whether the voice recognition result is accurate through the feedback program, if so, the feedback execution unit may use a voice characteristic frequency and an emotion analysis result related to the voice recognition result to expand a personalized voice emotion database more, and the personalized emotion database is established with a recognition object (user) as a unit, and when the system monitors the same type of voice of the same recognition object again, the system may be used for more accurate auxiliary determination.

As another preferred embodiment of the present invention, the physiological information of the subject includes facial expression information of the user and physiological information of the user, the facial information of the user is used for recording facial micro-expression changes of the user, the physiological information of the user is used for representing physiological state changes of the user, and the physiological state includes heartbeat blood pressure and the like.

In the embodiment, the physiological information of the subject is simply explained, that is, the physiological characteristics of emotion determination, the main content is the change of the micro-expression of the face, the change of the micro-expression of the face can reflect the change of human emotion, and the change of the physiological states such as heartbeat and blood pressure, so that in the prior art, the research on human emotion and the research on the micro-expression and the physiological change and the intelligent neural network for interpretation are more.

As shown in fig. 3, the present invention also provides an intelligent robot dialog intention recognition method, which comprises the following steps:

s200, acquiring voice contents, recognizing and converting the voice contents into character contents, analyzing the character contents through a language neural network algorithm, and generating a plurality of semantic prediction results which are used for representing information contents expressed by the voice contents.

S400, acquiring physiological information of the object, analyzing the physiological characteristics of the object through an emotional neural network algorithm, and generating an emotion analysis result, wherein the emotion analysis result represents emotion changes of the object expressed through facial micro-expressions and physiological reactions, and is used for assisting judgment of the semantic prediction result.

S600, performing emotion feature analysis on the voice content according to the voice emotion database to generate object voice emotion, wherein the object voice emotion represents emotion change expressed by the voice feature of an object, and the object voice emotion is used for assisting judgment of the semantic prediction result.

And S800, judging auxiliary content of the semantic prediction result according to the voice emotion of the object and the emotion analysis result, and generating a voice recognition result, wherein the voice recognition result is used for representing the information content expressed by the voice content.

As another preferred embodiment of the present invention, the step of acquiring physiological information of a subject and performing physiological characteristic analysis on the subject by using an emotional neural network algorithm to generate an emotion analysis result includes:

the method comprises the steps of resampling the voice content according to a preset sampling frequency to generate a voice sampling result, wherein the sampling frequency is determined by the processing efficiency requirement on the voice content and the accuracy requirement on emotional feature analysis, the preset sampling frequency is not greater than the inherent sampling frequency of the voice content, and the signal type of the voice sampling result is a discrete time signal.

And carrying out hierarchical quantization on the voice sampling result, carrying out binary coding, and generating a voice digital signal, wherein the hierarchical quantization is used for converting the discrete time signal into the digital signal.

And establishing a spectrogram according to the voice digital signal, and performing feature selection on the voice digital signal according to the spectrogram to generate voice frequency features.

As another preferred embodiment of the invention, the speech emotion database comprises a standard emotion database and a personalized emotion database;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An intelligent robotic dialog intent recognition system, comprising:

the semantic prediction module is used for acquiring voice contents, identifying and converting the voice contents into character contents, analyzing the character contents through a language neural network algorithm and generating semantic prediction results, wherein the number of the semantic prediction results is multiple, and the semantic prediction results are used for representing information contents expressed by the voice contents;

the object detection module is used for acquiring physiological information of an object, analyzing physiological characteristics of the object through an emotional neural network algorithm and generating an emotion analysis result, wherein the emotion analysis result represents emotion changes of the object expressed through facial micro-expressions and physiological reactions, and is used for assisting judgment of the semantic prediction result;

the voice emotion module is used for carrying out emotion feature analysis on the voice content according to a voice emotion database to generate object voice emotion, the object voice emotion represents emotion change expressed by voice features of an object, and the object voice emotion is used for assisting judgment of the semantic prediction result;

2. The intelligent robotic dialog intent recognition system of claim 1 wherein the speech emotion module comprises:

the characteristic analysis unit is used for establishing a spectrogram according to the voice digital signal, and performing characteristic selection on the voice digital signal according to the spectrogram to generate voice frequency characteristics;

3. The intelligent robot dialogue intention recognition system of claim 2, wherein the voice emotion database comprises a standard emotion database and a personalized emotion database;

4. The intelligent robotic dialog intent recognition system of claim 3, further comprising a decision feedback module comprising:

5. The intelligent robot dialogue intention recognition system of claim 1, wherein the object physiological information comprises user facial expression information and user physiological information, the user facial information is used for recording facial micro-expression changes of a user, the user physiological information is used for representing changes of physiological states of the user, and the physiological states comprise heartbeat blood pressure and the like.

6. An intelligent robot dialogue intention recognition method is characterized by comprising the following steps:

acquiring physiological information of a subject, and performing physiological feature analysis on the subject through an emotional neural network algorithm to generate an emotion analysis result, wherein the emotion analysis result represents emotion changes of the subject expressed through facial micro-expressions and physiological reactions, and is used for assisting judgment of the semantic prediction result;

performing emotion feature analysis on the voice content according to a voice emotion database to generate object voice emotion, wherein the object voice emotion represents emotion change expressed by voice features of an object, and the object voice emotion is used for assisting judgment of the semantic prediction result;

7. The intelligent robot dialogue intention recognition method according to claim 6, wherein the step of acquiring physiological information of a subject and performing physiological feature analysis on the subject through an emotional neural network algorithm to generate an emotion analysis result comprises:

8. The intelligent robot dialogue intention recognition method of claim 7, wherein the voice emotion database comprises a standard emotion database and a personalized emotion database;