CN112434139A - Information interaction method and device, electronic equipment and storage medium - Google Patents

Information interaction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112434139A
CN112434139A CN202011147857.8A CN202011147857A CN112434139A CN 112434139 A CN112434139 A CN 112434139A CN 202011147857 A CN202011147857 A CN 202011147857A CN 112434139 A CN112434139 A CN 112434139A
Authority
CN
China
Prior art keywords
voice information
information
emotion
user
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011147857.8A
Other languages
Chinese (zh)
Inventor
廖加威
钟鹏飞
车炜春
任晓华
黄晓琳
赵慧斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011147857.8A priority Critical patent/CN112434139A/en
Publication of CN112434139A publication Critical patent/CN112434139A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses an information interaction method, an information interaction device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence such as a voice recognition technology, a computer vision technology and deep learning. The specific implementation scheme is as follows: acquiring voice information input by a user; recognizing the emotion type of the user according to the voice information; responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types. The technical problems that a man-machine interaction mode is mechanical and is not in line with the image of an intelligent body are solved, the emotion categories corresponding to voice information are identified, the interface expressions corresponding to the emotion categories are displayed when the voice information is output in a broadcasting mode, different emotions correspond to different interface expressions, the natural feeling of human-to-human conversation is reproduced, the image of the intelligent body is in line with, and conversation experience is more interesting.

Description

Information interaction method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence such as a voice recognition technology, a computer vision technology and deep learning, in particular to an information interaction method, an information interaction device, electronic equipment and a storage medium.
Background
Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.
With the rapid development of artificial intelligence technology, human-computer interaction devices based on voice are more and more favored by users, for example, intelligent sound boxes, intelligent televisions, voice vehicles and the like become human-computer interaction devices frequently used in user life.
In the related technology, the robot has a dialogue interaction mode which lacks the perception of the emotional state of the user, the user inputs a voice command and usually directly replies the voice command of the user, and the mode is mechanical, does not have emotion and does not accord with the image of an intelligent agent.
Disclosure of Invention
The application provides an information interaction method, an information interaction device, electronic equipment and a storage medium for solving the technical problem.
According to a first aspect, an information interaction method is provided, which includes:
acquiring voice information input by a user;
recognizing the emotion type of the user according to the voice information;
responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types.
According to a second aspect, there is provided an information interaction apparatus, comprising:
the acquisition module is used for acquiring voice information input by a user;
the recognition module is used for recognizing the emotion type of the user according to the voice information;
and the processing module is used for responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types.
According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the information interaction method described in the above embodiments.
According to a fourth aspect, a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the information interaction method described in the above embodiments is proposed.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow diagram of an information interaction method according to one embodiment of the present application;
FIG. 2 is a flow diagram of a method of information interaction according to one embodiment of the present application;
FIG. 3 is a schematic view of a scene of an information interaction method according to an embodiment of the present application;
FIG. 4 is a scene schematic diagram of an information interaction method according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of an information interaction device according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of an information interaction device according to an embodiment of the present application;
FIG. 7 is a block diagram of an electronic device for implementing a method of information interaction according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
An information interaction method, an apparatus, an electronic device, and a storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.
In order to solve the technical problems that in the prior art, a man-machine interaction mode is mechanical and is not in line with the image of an intelligent body, the application provides that the interface expressions corresponding to the emotion categories are displayed when voice information is broadcasted and output by recognizing the emotion categories corresponding to voice information, so that different emotions correspond to different interface expressions, the natural feeling of human-to-human conversation is reproduced, the image of the intelligent body is in line with, and the conversation experience is more interesting.
Specifically, fig. 1 is a flowchart of an information interaction method according to an embodiment of the present application, and as shown in fig. 1, the information interaction method is used in an electronic device, where the electronic device may be any device with computing capability, for example, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, an in-vehicle device, a smart speaker, a smart television, and other hardware devices with various operating systems, touch screens, and/or display screens. The method comprises the following steps:
step 101, acquiring voice information input by a user.
And step 102, recognizing the emotion type of the user according to the voice information.
In the embodiment of the application, human-computer interaction devices such as smart speakers, smart televisions and the like all have voice information input by users and collected by sound collection devices such as one or more microphone arrays and the like.
In the embodiment of the present application, after the voice information input by the user is obtained, there are various ways of recognizing the emotion category of the user according to the voice information, and the selection setting may be performed according to a specific application scenario, for example, as follows:
the first example is that voice information is converted to obtain corresponding text information, word segmentation is performed on the text information to generate a plurality of participles, the participles are matched with preset candidate keywords to obtain target keywords which are successfully matched with one or more participles, preset emotion classification information corresponding to the target keywords is inquired, and emotion categories of users are identified.
The second example is that voice information is converted to obtain corresponding text information, the text information is processed based on a semantic understanding algorithm, syntax and semantic analysis are simultaneously performed on the text information, and the emotion type of the user is identified by using the syntax information and the semantic information.
In a third example, an emotion classification model is established in advance according to different text voice samples and a neural network (such as a convolutional neural network), a classifier and the like, voice information is input into the emotion classification model established in advance, and emotion classification of a user corresponding to the voice information is obtained.
And 103, responding to the input voice information, broadcasting corresponding output voice information and displaying a preset interface expression corresponding to the emotion type.
In the embodiment of the application, the voice information input by the user is processed to generate the output voice information, and the output voice information is broadcasted, for example, the voice information input by the user is 'what the weather is today', and the corresponding output voice information 'today beijing is sunny, the temperature is 16-20 degrees, and northwest wind' is broadcasted.
In the embodiment of the present application, there are many ways to process and generate output voice information based on voice information input by a user, and the selection setting may be performed according to a specific application scenario, for example, as follows:
in a first example, a voice reply model is pre-established based on a neural network (such as a convolutional neural network) and an input voice sample, and output voice information is generated by directly processing voice information input by a user through the pre-established voice reply model.
In a second example, voice information input by a user is converted into text information, the text information is matched in a preset knowledge base to obtain output text information, and the output text information is converted into output voice information.
In the embodiment of the application, one or more interface expressions corresponding to the emotion categories of the user identified according to the voice information may be selected and output according to a specific scene, or may be selected and set in combination with the voice duration, the voice characteristics, the user preference, and the like, as described in the following.
As an example of a scenario, when the interface expressions corresponding to the emotion categories may be one, it is sufficient to broadcast the corresponding output voice information and display the preset interface expressions corresponding to the emotion categories.
As another example scenario, when the emotion category includes a plurality of candidate interface expressions, a broadcast duration of the output voice information is acquired; determining one or more target interface expressions to be displayed and display time corresponding to each target interface expression according to preset display probability corresponding to each candidate interface expression and the broadcast time; and in the process of broadcasting the corresponding output voice information, switching and displaying one or more target interface expressions according to the display duration corresponding to each target interface expression.
As yet another scenario example, when the emotion classification includes a plurality of candidate interface expressions, an audio feature of the speech information is extracted; acquiring the energy of the voice information according to the audio characteristics; inputting the energy of the voice information into a pre-trained deep learning model, and acquiring the occurrence probability corresponding to each candidate interface expression; and displaying the candidate interface expression with the maximum probability in the process of broadcasting the corresponding output voice information.
In summary, the information interaction method of the embodiment of the application obtains the voice information input by the user; recognizing the emotion type of the user according to the voice information; responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types. The technical problems that a man-machine interaction mode is mechanical and is not in line with the image of an intelligent body are solved, the emotion categories corresponding to voice information are identified, the interface expressions corresponding to the emotion categories are displayed when the voice information is output in a broadcasting mode, different emotions correspond to different interface expressions, the natural feeling of human-to-human conversation is reproduced, the image of the intelligent body is in line with, and conversation experience is more interesting.
Based on the description of the above embodiment, it can be understood that when the emotion category includes a plurality of candidate interface expressions, the displayed interface expression may be determined based on the announcement duration of the output voice information, the energy of the voice information, and the like, which will be described in detail below with reference to fig. 2 and 3.
Fig. 2 is a flowchart of an information interaction method according to an embodiment of the present application, and as shown in fig. 2, the method includes:
step 201, acquiring voice information input by a user, performing conversion processing on the voice information to acquire corresponding text information, and performing word segmentation processing on the text information to generate a plurality of participles.
In the embodiment of the application, human-computer interaction devices such as smart speakers, smart televisions and the like all have voice information input by users and collected by sound collection devices such as one or more microphone arrays and the like.
In the embodiment of the application, after the voice information input by the user is acquired, the voice information is converted to acquire the corresponding text information, and the text information is subjected to word segmentation to generate a plurality of word segments.
In the embodiment of the present application, there are many ways to convert the voice information to obtain corresponding text information, and the setting may be selected according to a specific application scenario, for example, as follows:
in a first example, a voice text converter performs conversion processing on voice information to obtain corresponding text information.
As a second example, the corresponding text information may be obtained by performing a conversion process on the speech information through a text-to-speech (TTS) engine.
In the embodiment of the present application, there are many ways to generate multiple word segments by performing word segmentation processing on text information, and the setting may be selected according to a specific application scenario, for example, as follows:
in a first example, a word segmentation method based on character string matching processes text information, matches the text information with words in a machine dictionary, and if a certain character string is found in the machine dictionary, a word segmentation is recognized, so as to generate a plurality of words.
In a second example, the word segmentation is performed on the text information based on a word segmentation model, and the word segmentation model can adopt the existing word segmentation model.
Step 202, matching the multiple participles with preset candidate keywords, obtaining target keywords successfully matched with one or more participles, inquiring preset emotion classification information corresponding to the target keywords, and identifying the emotion types of the user.
In the embodiment of the application, mapping relationships between a plurality of keywords and emotion categories are established in advance, for example, the keywords A, B and C correspond to the forward emotion category; the negative emotion categories correspond to keywords D, E, F and the like, and the keywords A, B, C, D, E and F can be used as preset candidate keywords.
In the embodiment of the present application, the manner of matching the multiple participles with the preset candidate keywords may be understood as matching the multiple participles with each keyword in the preset candidate keywords, which may be matching of words, semantics, and the like.
In the embodiment of the present application, after a target keyword successfully matched with one or more segmented words is obtained, preset emotion classification information corresponding to the target keyword is queried, and there are various ways of identifying the emotion classification of a user, which can be selected and set according to an application scenario, for example, as follows:
as a possible implementation manner, under the condition that the target keyword belongs to the preset first emotion classification information through inquiry, the user is identified as a forward emotion category; under the condition that the target keyword belongs to preset second emotion classification information, identifying the user as a negative emotion category; and under the condition that the target keyword does not belong to the preset first emotion classification information and second emotion classification information, identifying the user as a neutral emotion classification.
Continuing to explain by taking the example as an example, the target keyword is a, and the user is identified as the forward emotion category when the target keyword a is found to belong to the preset forward emotion category information through query; the target keyword is D, and the user is identified as a negative emotion category under the condition that the target keyword belongs to preset negative emotion category information; and the target keyword is H, and under the condition that the target keyword D belongs to preset negative emotion classification information through inquiry, the user is identified as a neutral emotion classification under the condition that the target keyword H does not belong to preset positive emotion classification information and negative emotion classification information through inquiry.
Step 203, responding to the input voice information, and when the emotion type comprises a plurality of candidate interface expressions, acquiring a broadcast time length of the output voice information.
And 204, determining one or more target interface expressions to be displayed and display duration corresponding to each target interface expression according to preset display probability and broadcast duration corresponding to each candidate interface expression.
And step 205, in the process of broadcasting the corresponding output voice information, switching and displaying one or more target interface expressions according to the display duration corresponding to each target interface expression.
In this embodiment of the application, one emotion category includes a plurality of candidate interface expressions, and when the emotion category includes a plurality of candidate interface expressions, a broadcast time length for outputting voice information is obtained, for example, 10 seconds, 20 seconds, and the like.
In the embodiment of the application, the display probability and the broadcast duration corresponding to each candidate interface expression can be preset, for example, the display probability x1 and the broadcast duration y1 corresponding to the candidate interface expression 1; candidate interface expression 2 corresponds to display probability x2, broadcast time length y2 and the like, and candidate interface expression 3 corresponds to display probability x3 and broadcast time length y 3.
Therefore, when the broadcast duration of the output voice message is determined to be 10 seconds, for example, a target interface expression 1 is matched, the corresponding display probability x1 is ninety percent, and the broadcast duration is 10 seconds; when the broadcasting time length of the output voice information is determined to be 20 seconds, matching a target interface expression 1 and a target interface expression 2, wherein the display probability x1 corresponding to the target interface expression 1 is ninety percent and the broadcasting time length is 10 seconds; the target interface expression 2 corresponds to a display probability x1 of eighty-five percent and a broadcast duration of 10 seconds.
In the embodiment of the application, the broadcasting time of the voice information can be output, the preset display probability corresponding to each candidate interface expression and the broadcasting time are matched, after one or more interface expressions of the broadcasting time are matched, one target interface expression or a plurality of target interface expressions are selected to be displayed in a combined mode according to the display probability and the display time corresponding to each target interface expression, and in the process of finally broadcasting the corresponding voice information, one or more target interface expressions are switched and displayed according to the display time corresponding to each target interface expression.
In the embodiment of the application, it can be understood that one or more target interface expressions are sequentially switched and displayed or one or more target interface expressions are randomly switched and displayed according to the display duration corresponding to each target interface expression, so that the flexibility and the interestingness of displaying the interface expressions are further improved.
It should be noted that, in order to further meet the user usage requirements, the user may select interface expressions according to an actual application scenario, set display probabilities and broadcast durations corresponding to the interface expressions to be stored in advance, and establish associations between the interface expressions and corresponding emotion categories, for example, an animal expression that the user likes is used as an interface expression corresponding to a forward emotion category, so as to meet the user personalized requirements while meeting the interaction interests.
To sum up, in the information interaction method of the embodiment of the application, by obtaining voice information input by a user, performing conversion processing on the voice information to obtain corresponding text information, performing word segmentation processing on the text information to generate a plurality of segmented words, matching the segmented words with preset candidate keywords, obtaining target keywords successfully matched with one or more segmented words, inquiring preset emotion classification information corresponding to the target keywords, identifying emotion categories of the user, responding to the input voice information, obtaining broadcast duration of output voice information when the emotion categories include a plurality of candidate interface expressions, determining one or more target interface expressions to be displayed and display duration corresponding to each target interface expression according to preset display probability and broadcast duration corresponding to each candidate interface expression, and in the process of broadcasting the corresponding output voice information, and switching and displaying one or more target interface expressions according to the display duration corresponding to each target interface expression. Therefore, by recognizing the emotion types corresponding to the voice information, the interface expressions corresponding to the emotion types are displayed when the voice information is broadcasted and output, different interface expressions corresponding to different emotions are achieved, further, the same emotion can achieve the effect that different interface expressions are displayed, and the human-computer conversation experience is more vivid and interesting.
Fig. 3 is a flowchart of an information interaction method according to an embodiment of the present application, as shown in fig. 2, the method includes:
step 301, acquiring voice information input by a user.
And step 302, recognizing the emotion type of the user according to the voice information.
In the embodiment of the application, human-computer interaction devices such as smart speakers, smart televisions and the like all have voice information input by users and collected by sound collection devices such as one or more microphone arrays and the like.
In the embodiment of the present application, after the voice information input by the user is obtained, there are various ways of recognizing the emotion category of the user according to the voice information, and the selection setting may be performed according to a specific application scenario, for example, as follows:
the first example is that voice information is converted to obtain corresponding text information, word segmentation is performed on the text information to generate a plurality of participles, the participles are matched with preset candidate keywords to obtain target keywords which are successfully matched with one or more participles, preset emotion classification information corresponding to the target keywords is inquired, and emotion categories of users are identified.
The second example is that voice information is converted to obtain corresponding text information, the text information is processed based on a semantic understanding algorithm, syntax and semantic analysis are simultaneously performed on the text information, and the emotion type of the user is identified by using the syntax information and the semantic information.
In a third example, an emotion classification model is established in advance according to different text voice samples and a neural network (such as a convolutional neural network), a classifier and the like, voice information is input into the emotion classification model established in advance, and emotion classification of a user corresponding to the voice information is obtained.
And 303, when the emotion category comprises a plurality of candidate interface expressions, extracting the audio features of the voice information, and acquiring the energy of the voice information according to the audio features.
And step 304, inputting the energy of the voice information into a pre-trained deep learning model, and acquiring the occurrence probability corresponding to each candidate interface expression.
And 305, displaying the candidate interface expression with the maximum probability in the process of broadcasting the corresponding output voice information.
In the embodiment of the application, when the emotion category includes a plurality of candidate interface expressions, audio features of voice information, such as duration-related features, fundamental frequency-related features, energy-related features and the like, are extracted through a voice extraction model or a voice extraction algorithm and the like, further energy of the voice information, namely positive energy, medium energy and negative energy, is obtained according to the audio features, the energy of the voice information is input into a pre-trained deep learning model, and occurrence probability corresponding to each candidate interface expression is obtained.
In the embodiment of the application, it can be understood that the energy samples of the voice information and the appearance probability samples corresponding to each candidate interface expression are trained in advance based on the neural network to generate the deep learning model, so that the energy of the voice information is input into the deep learning model trained in advance to obtain the appearance probability corresponding to each candidate interface expression.
And finally, displaying the candidate interface expression with the maximum probability in the process of broadcasting the corresponding output voice information.
In summary, the information interaction method of the embodiment of the application obtains the voice information input by the user; the emotion classification of a user is recognized according to the voice information, when the emotion classification comprises a plurality of candidate interface expressions, the audio characteristics of the voice information are extracted, the energy of the voice information is obtained according to the audio characteristics, the energy of the voice information is input into a pre-trained deep learning model, the occurrence probability step corresponding to each candidate interface expression is obtained, and the candidate interface expression with the maximum occurrence probability is displayed in the process of broadcasting the corresponding output voice information. Therefore, by recognizing the emotion types corresponding to the voice information, the interface expressions corresponding to the emotion types are displayed when the voice information is broadcasted and output, different interface expressions corresponding to different emotions are achieved, further, the same emotion can achieve the effect that different interface expressions are displayed, and the human-computer conversation experience is more vivid and interesting.
Based on the description, the method mainly uses the voice information of the user as the text information, then carries out emotion recognition on the text information, recognizes the emotion of the user contained behind the text information, and broadcasts the corresponding output voice information, wherein different emotions correspond to different interface expressions. The interactive mode of the conversation restores the natural feeling of the conversation with people, better accords with the image of an intelligent body, has more interesting conversation experience, and stimulates the exploration desire of users.
For example, the speech information input by the user is acquired, emotion recognition is performed on the speech information, it can be understood that in order to further meet the use requirements of the user, a robot interface expression system can be drawn according to human emotion, for example, interface expressions including 32 types of dynamic interface expressions are included, and interface expressions capable of expressing "forward and neutral emotions" are extracted from the expression system based on an actual conversation situation, for example, interface expressions capable of expressing "forward and neutral emotions" can be extracted, and the interface expressions can be 9 types of forward interface expressions; and determining the probability of the 9 types of interface expressions according to the positive degree of the emotion.
Specifically, a user inputs voice information, the robot recognizes the voice information as character information in real time, emotion characteristics contained in a dialog text are automatically detected, for example, in fig. 4, the emotion of the user is recognized as a forward emotion and a neutral emotion, and a forward interface expression is displayed on a robot interface; continuously recognizing the emotion as positive emotion and neutral emotion, and switching and displaying the interface expression according to a certain occurrence probability; the negative emotion is recognized, the interface can display the interface expression of the neutral emotion, the interface displays different interface expressions, and the human-computer conversation experience is more vivid and interesting by combining corresponding voice information feedback.
In order to implement the above embodiments, the present application further provides an information interaction apparatus. Fig. 5 is a schematic structural diagram of an information interaction device according to an embodiment of the present application, and as shown in fig. 5, the information interaction device includes: an acquisition module 510, a recognition module 520, and a processing module 530.
An obtaining module 510, configured to obtain voice information input by a user.
And the recognition module 520 is used for recognizing the emotion category of the user according to the voice information.
And the processing module 530 is configured to respond to the input voice information, report corresponding output voice information, and display a preset interface expression corresponding to the emotion category.
It should be noted that the foregoing explanation of the information interaction method is also applicable to the information interaction apparatus in the embodiment of the present invention, and the implementation principle is similar, and is not repeated here.
In summary, the information interaction device according to the embodiment of the present application obtains the voice information input by the user; recognizing the emotion type of the user according to the voice information; responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types. The technical problems that a man-machine interaction mode is mechanical and is not in line with the image of an intelligent body are solved, the emotion categories corresponding to voice information are identified, the interface expressions corresponding to the emotion categories are displayed when the voice information is output in a broadcasting mode, different emotions correspond to different interface expressions, the natural feeling of human-to-human conversation is reproduced, the image of the intelligent body is in line with, and conversation experience is more interesting.
In one embodiment of the present application, as shown in fig. 6, on the basis as shown in fig. 5, the identification module 520 includes: 52 obtaining unit 1, word cutting unit 522, matching unit 523 and query unit 524.
The obtaining unit 521 is configured to perform conversion processing on the voice information to obtain corresponding text information.
And a word segmentation unit 522, configured to perform word segmentation processing on the text information to generate a plurality of segmented words.
The matching unit 523 is configured to match the multiple segmentations with preset candidate keywords, and obtain a target keyword successfully matched with one or more segmentations.
The query unit 524 is configured to query preset emotion classification information corresponding to the target keyword, and identify an emotion category of the user.
In an embodiment of the present application, the querying unit 524 is specifically configured to: under the condition that the target keyword belongs to preset first emotion classification information, identifying the user as a forward emotion category; under the condition that the target keyword belongs to preset second emotion classification information, identifying the user as a negative emotion category; and under the condition that the target keyword does not belong to the preset first emotion classification information and second emotion classification information, identifying the user as a neutral emotion classification.
In an embodiment of the application, when the emotion category includes a plurality of candidate interface expressions, the processing module 530 is specifically configured to: acquiring broadcasting time of output voice information; determining one or more target interface expressions to be displayed and display time corresponding to each target interface expression according to preset display probability corresponding to each candidate interface expression and the broadcast time; and in the process of broadcasting the corresponding output voice information, switching and displaying one or more target interface expressions according to the display duration corresponding to each target interface expression.
In an embodiment of the application, when the emotion category includes a plurality of candidate interface expressions, the processing module 530 is specifically configured to: extracting audio features of the voice information; acquiring the energy of the voice information according to the audio characteristics; inputting the energy of the voice information into a pre-trained deep learning model, and acquiring the occurrence probability corresponding to each candidate interface expression; and displaying the candidate interface expression with the maximum probability in the process of broadcasting the corresponding output voice information.
It should be noted that the foregoing explanation of the information interaction method is also applicable to the information interaction apparatus in the embodiment of the present invention, and the implementation principle is similar, and is not repeated here.
In summary, the information interaction device according to the embodiment of the present application obtains the voice information input by the user; recognizing the emotion type of the user according to the voice information; responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types. The technical problems that a man-machine interaction mode is mechanical and is not in line with the image of an intelligent body are solved, the emotion categories corresponding to voice information are identified, the interface expressions corresponding to the emotion categories are displayed when the voice information is output in a broadcasting mode, different emotions correspond to different interface expressions, the natural feeling of human-to-human conversation is reproduced, the image of the intelligent body is in line with, and conversation experience is more interesting. Furthermore, different interface expressions can be displayed in the same emotion, and the human-computer conversation experience is more vivid and interesting.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 7 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor 701 is taken as an example.
The memory 702 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of information interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of information interaction provided herein.
Memory 702, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., acquisition module 510, recognition module 520, and processing module 530 shown in fig. 5) corresponding to the method of information interaction in the embodiments of the present application. The processor 701 executes various functional applications of the server and data processing, i.e., a method for information interaction in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 702.
The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device for information interaction, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 may optionally include memory located remotely from the processor 701, which may be connected to an information-interacting electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the information interaction method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device with which the information is interacted, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device. These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in the traditional physical host and VPS (Virtual Private Server) service.
According to the technical scheme of the embodiment of the application, voice information input by a user is obtained; recognizing the emotion type of the user according to the voice information; responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types. The technical problems that a man-machine interaction mode is mechanical and is not in line with the image of an intelligent body are solved, the emotion categories corresponding to voice information are identified, the interface expressions corresponding to the emotion categories are displayed when the voice information is output in a broadcasting mode, different emotions correspond to different interface expressions, the natural feeling of human-to-human conversation is reproduced, the image of the intelligent body is in line with, and conversation experience is more interesting.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (12)

1. An information interaction method, comprising:
acquiring voice information input by a user;
recognizing the emotion type of the user according to the voice information;
responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types.
2. The method of claim 1, wherein said identifying an emotion category of the user based on the speech information comprises:
converting the voice information to obtain corresponding text information;
performing word segmentation processing on the text information to generate a plurality of word segments;
matching the multiple participles with preset candidate keywords to obtain target keywords successfully matched with one or more participles;
and inquiring preset emotion classification information corresponding to the target keywords, and identifying the emotion category of the user.
3. The method of claim 2, wherein the querying preset emotion classification information corresponding to the target keyword to identify the emotion category of the user comprises:
under the condition that the target keyword belongs to preset first emotion classification information, identifying the user as a forward emotion category;
under the condition that the target keyword belongs to preset second emotion classification information, identifying the user as a negative emotion category;
and under the condition that the target keyword does not belong to preset first emotion classification information and second emotion classification information, identifying the user as a neutral emotion category.
4. The method of claim 1, wherein when the emotion classification includes a plurality of candidate interface expressions,
broadcast the corresponding output voice message and show preset with the interface expression that the emotion classification corresponds includes:
acquiring the broadcasting time of the output voice information;
determining one or more target interface expressions to be displayed and display time corresponding to each target interface expression according to preset display probability corresponding to each candidate interface expression and the broadcast time;
and in the process of broadcasting the corresponding output voice information, switching and displaying one or more target interface expressions according to the display duration corresponding to each target interface expression.
5. The method of claim 1, wherein when the emotion classification includes a plurality of candidate interface expressions,
broadcast the corresponding output voice message and show preset with the interface expression that the emotion classification corresponds includes:
extracting audio features of the voice information;
acquiring the energy of the voice information according to the audio features;
inputting the energy of the voice information into a pre-trained deep learning model, and acquiring the occurrence probability corresponding to each candidate interface expression;
and displaying the candidate interface expression with the maximum probability in the process of broadcasting the corresponding output voice information.
6. An information interaction apparatus, comprising:
the acquisition module is used for acquiring voice information input by a user;
the recognition module is used for recognizing the emotion type of the user according to the voice information;
and the processing module is used for responding to the input voice information, broadcasting corresponding output voice information and displaying preset interface expressions corresponding to the emotion types.
7. The apparatus of claim 6, wherein the identification module comprises:
the acquisition unit is used for converting the voice information to acquire corresponding text information;
the word segmentation unit is used for carrying out word segmentation processing on the text information to generate a plurality of word segments;
the matching unit is used for matching the multiple participles with preset candidate keywords to obtain target keywords which are successfully matched with one or more participles;
and the query unit is used for querying preset emotion classification information corresponding to the target key words and identifying the emotion types of the users.
8. The apparatus of claim 7, wherein the query unit is specifically configured to:
under the condition that the target keyword belongs to preset first emotion classification information, identifying the user as a forward emotion category;
under the condition that the target keyword belongs to preset second emotion classification information, identifying the user as a negative emotion category;
and under the condition that the target keyword does not belong to preset first emotion classification information and second emotion classification information, identifying the user as a neutral emotion category.
9. The apparatus of claim 6, wherein when the emotion classification includes a plurality of candidate interface expressions,
the processing module is specifically configured to:
acquiring the broadcasting time of the output voice information;
determining one or more target interface expressions to be displayed and display time corresponding to each target interface expression according to preset display probability corresponding to each candidate interface expression and the broadcast time;
and in the process of broadcasting the corresponding output voice information, switching and displaying one or more target interface expressions according to the display duration corresponding to each target interface expression.
10. The apparatus of claim 6, wherein when the emotion classification includes a plurality of candidate interface expressions,
the processing module is specifically configured to:
extracting audio features of the voice information;
acquiring the energy of the voice information according to the audio features;
inputting the energy of the voice information into a pre-trained deep learning model, and acquiring the occurrence probability corresponding to each candidate interface expression;
and displaying the candidate interface expression with the maximum probability in the process of broadcasting the corresponding output voice information.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the information interaction method of any one of claims 1-5.
12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the information interaction method according to any one of claims 1 to 5.
CN202011147857.8A 2020-10-23 2020-10-23 Information interaction method and device, electronic equipment and storage medium Pending CN112434139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011147857.8A CN112434139A (en) 2020-10-23 2020-10-23 Information interaction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011147857.8A CN112434139A (en) 2020-10-23 2020-10-23 Information interaction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112434139A true CN112434139A (en) 2021-03-02

Family

ID=74695969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011147857.8A Pending CN112434139A (en) 2020-10-23 2020-10-23 Information interaction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112434139A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860213A (en) * 2021-03-09 2021-05-28 腾讯科技(深圳)有限公司 Audio processing method, storage medium and electronic equipment
CN113053388A (en) * 2021-03-09 2021-06-29 北京百度网讯科技有限公司 Voice interaction method, device, equipment and storage medium
CN113569031A (en) * 2021-07-30 2021-10-29 北京达佳互联信息技术有限公司 Information interaction method and device, electronic equipment and storage medium
CN114237395A (en) * 2021-12-14 2022-03-25 北京百度网讯科技有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN114360535A (en) * 2021-12-24 2022-04-15 北京百度网讯科技有限公司 Voice conversation generation method and device, electronic equipment and storage medium
WO2023246076A1 (en) * 2022-06-24 2023-12-28 上海哔哩哔哩科技有限公司 Emotion category recognition method, apparatus, storage medium and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160063148A (en) * 2014-11-26 2016-06-03 현대자동차주식회사 Apparatus and method of analysis of the situation for vehicle voice recognition system
CN106024014A (en) * 2016-05-24 2016-10-12 努比亚技术有限公司 Voice conversion method and device and mobile terminal
CN110446000A (en) * 2019-08-07 2019-11-12 三星电子(中国)研发中心 A kind of figural method and apparatus of generation dialogue
CN111106995A (en) * 2019-12-26 2020-05-05 腾讯科技(深圳)有限公司 Message display method, device, terminal and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160063148A (en) * 2014-11-26 2016-06-03 현대자동차주식회사 Apparatus and method of analysis of the situation for vehicle voice recognition system
CN106024014A (en) * 2016-05-24 2016-10-12 努比亚技术有限公司 Voice conversion method and device and mobile terminal
CN110446000A (en) * 2019-08-07 2019-11-12 三星电子(中国)研发中心 A kind of figural method and apparatus of generation dialogue
CN111106995A (en) * 2019-12-26 2020-05-05 腾讯科技(深圳)有限公司 Message display method, device, terminal and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王廷银;林明贵;陈达;吴允平;: "基于北斗RDSS的核辐射监测应急通讯方法", 计算机系统应用, no. 12, 15 December 2019 (2019-12-15) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860213A (en) * 2021-03-09 2021-05-28 腾讯科技(深圳)有限公司 Audio processing method, storage medium and electronic equipment
CN113053388A (en) * 2021-03-09 2021-06-29 北京百度网讯科技有限公司 Voice interaction method, device, equipment and storage medium
CN112860213B (en) * 2021-03-09 2023-08-25 腾讯科技(深圳)有限公司 Audio processing method and device, storage medium and electronic equipment
CN113569031A (en) * 2021-07-30 2021-10-29 北京达佳互联信息技术有限公司 Information interaction method and device, electronic equipment and storage medium
CN114237395A (en) * 2021-12-14 2022-03-25 北京百度网讯科技有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN114360535A (en) * 2021-12-24 2022-04-15 北京百度网讯科技有限公司 Voice conversation generation method and device, electronic equipment and storage medium
CN114360535B (en) * 2021-12-24 2023-01-31 北京百度网讯科技有限公司 Voice conversation generation method and device, electronic equipment and storage medium
WO2023246076A1 (en) * 2022-06-24 2023-12-28 上海哔哩哔哩科技有限公司 Emotion category recognition method, apparatus, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN112434139A (en) Information interaction method and device, electronic equipment and storage medium
CN108962255B (en) Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
CN110991427B (en) Emotion recognition method and device for video and computer equipment
CN111625635A (en) Question-answer processing method, language model training method, device, equipment and storage medium
CN111177355B (en) Man-machine conversation interaction method and device based on search data and electronic equipment
CN112466302B (en) Voice interaction method and device, electronic equipment and storage medium
CN111666380A (en) Intelligent calling method, device, equipment and medium
CN110719525A (en) Bullet screen expression package generation method, electronic equipment and readable storage medium
CN111324727A (en) User intention recognition method, device, equipment and readable storage medium
CN111177462B (en) Video distribution timeliness determination method and device
CN111966212A (en) Multi-mode-based interaction method and device, storage medium and smart screen device
CN112509552A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112382294B (en) Speech recognition method, device, electronic equipment and storage medium
CN111105800A (en) Voice interaction processing method, device, equipment and medium
CN111680517A (en) Method, apparatus, device and storage medium for training a model
CN112382287A (en) Voice interaction method and device, electronic equipment and storage medium
CN115410572A (en) Voice interaction method, device, terminal, storage medium and program product
CN112559715B (en) Attitude identification method, device, equipment and storage medium
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN113743267A (en) Multi-mode video emotion visualization method and device based on spiral and text
CN112382292A (en) Voice-based control method and device
CN112328776A (en) Dialog generation method and device, electronic equipment and storage medium
CN110633357A (en) Voice interaction method, device, equipment and medium
CN116303951A (en) Dialogue processing method, device, electronic equipment and storage medium
CN114490967B (en) Training method of dialogue model, dialogue method and device of dialogue robot and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination