CN111327772A - Method, device, equipment and storage medium for automatic voice response processing - Google Patents

Method, device, equipment and storage medium for automatic voice response processing Download PDF

Info

Publication number
CN111327772A
CN111327772A CN202010114987.5A CN202010114987A CN111327772A CN 111327772 A CN111327772 A CN 111327772A CN 202010114987 A CN202010114987 A CN 202010114987A CN 111327772 A CN111327772 A CN 111327772A
Authority
CN
China
Prior art keywords
user
target
information
style information
style
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010114987.5A
Other languages
Chinese (zh)
Other versions
CN111327772B (en
Inventor
原俊
郭润增
黄家宇
吴志伟
张颖
耿志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Tencent Technology Co Ltd
Original Assignee
Guangzhou Tencent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Tencent Technology Co Ltd filed Critical Guangzhou Tencent Technology Co Ltd
Priority to CN202010114987.5A priority Critical patent/CN111327772B/en
Publication of CN111327772A publication Critical patent/CN111327772A/en
Application granted granted Critical
Publication of CN111327772B publication Critical patent/CN111327772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • H04M3/4936Speech interaction details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/64Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for automatic voice response processing, and belongs to the technical field of internet. The method comprises the following steps: acquiring image data of a user; determining attribute state information of the user based on the image data and a pre-trained user attribute state analysis model; determining target interaction style information for performing automatic voice response on the user based on the attribute state information and a pre-trained interaction style analysis model; and performing automatic voice response processing based on the target interaction style information. According to the method and the device, the corresponding interaction style information is determined by acquiring the attribute state information of the user, and then the automatic voice response is carried out with the user according to the determined interaction style information, so that the flexibility of the automatic voice response is improved.

Description

Method, device, equipment and storage medium for automatic voice response processing
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for performing automatic voice response processing.
Background
With the development of artificial intelligence, more and more devices can realize the function of voice interaction with users, for example, an intelligent robot can perform dialogue communication with users.
In the prior art, various devices can recognize the voice of a user through a voice recognition technology, determine the dialogue content with the user according to a pre-trained voice dialogue model, and finally play the audio of the dialogue content through a terminal, thereby completing the voice interaction with the user.
In the process of implementing the present application, the inventor finds that the prior art has at least the following problems: when the terminal and the user perform voice interaction, the voice style corresponding to the played audio is single, and the terminal and all the users perform conversation by using the same voice style, so that the flexibility of performing automatic voice response is poor.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for automatic voice response processing, which can increase the diversity of voice styles corresponding to played audio when a terminal and a user perform voice interaction, and the technical scheme is as follows:
in one aspect, a method for performing automatic voice response processing is provided, the method comprising:
acquiring image data of a user;
determining attribute state information of the user based on the image data and a pre-trained user attribute state analysis model;
determining target interaction style information for performing automatic voice response on the user based on the attribute state information and a pre-trained interaction style analysis model;
and performing automatic voice response processing based on the target interaction style information.
Optionally, after acquiring the image data of the user, the method further includes:
carrying out face recognition on the image data of the user;
determining an account of the user based on the image data of the user, and acquiring historical operation information of the account;
the determining target interaction style information for performing automatic voice response on the user based on the attribute state information and a pre-trained interaction style analysis model comprises:
and determining target interaction style information for performing automatic voice response on the user based on the attribute state information, the historical operation information and a pre-trained interaction style analysis model.
Optionally, the target interaction style information includes target voice style information;
the automatic voice response processing based on the target voice style information comprises the following steps:
acquiring a user voice audio;
identifying the user audio to generate corresponding characters;
determining target interactive characters based on the characters and a pre-trained dialogue model;
converting the target interactive characters into target response voice audio corresponding to the target voice style information based on a voice synthesis algorithm and the adjusting parameters corresponding to the target voice style information;
and playing the target response voice audio.
Optionally, the target interaction style information further includes target background music style information;
the method further comprises the following steps:
and playing the background music corresponding to the target background music style information.
Optionally, the target interaction style information further includes target display screen style information;
the method further comprises the following steps:
and displaying the picture corresponding to the target picture style information.
Optionally, after acquiring the image data of the user, the method further includes:
determining an account of the user based on the image data of the user, and acquiring target picture style information corresponding to the account;
and displaying the picture corresponding to the target picture style information.
Optionally, before acquiring the image data of the user, the method further includes:
randomly displaying pictures corresponding to the various picture style information;
acquiring image data of the user when displaying the pictures corresponding to the various picture style information;
respectively inputting the user image data corresponding to the various image style information into the user attribute state analysis model to obtain attribute state information corresponding to the various image style information, wherein the attribute state information comprises expression information;
selecting target picture style information from the multiple kinds of picture style information based on expression information corresponding to the multiple kinds of picture style information;
and correspondingly storing the target picture style information and the currently logged account.
In another aspect, there is provided an apparatus for performing automatic voice response processing, the apparatus including:
an acquisition module configured to acquire image data of a user;
a first determination module configured to determine attribute state information of the user based on the image data and a pre-trained user attribute state analysis model;
a second determination module configured to determine target interaction style information for performing an automatic voice response for the user based on the attribute state information and a pre-trained interaction style analysis model;
and the processing module is configured to perform automatic voice response processing based on the target interaction style information.
Optionally, the apparatus further comprises an identification module configured to:
carrying out face recognition on the image data of the user;
determining an account of the user based on the image data of the user, and acquiring historical operation information of the account;
the second determination module configured to:
and determining target interaction style information for performing automatic voice response on the user based on the attribute state information, the historical operation information and a pre-trained interaction style analysis model.
Optionally, the target interaction style information includes target voice style information;
the processing module comprises:
acquiring a user voice audio;
identifying the user audio to generate corresponding characters;
determining target interactive characters based on the characters and a pre-trained dialogue model;
converting the target interactive characters into target response voice audio corresponding to the target voice style information based on a voice synthesis algorithm and the adjusting parameters corresponding to the target voice style information;
and playing the target response voice audio.
Optionally, the target interaction style information further includes target background music style information;
the apparatus also includes a play module configured to:
and playing the background music corresponding to the target background music style information.
Optionally, the target interaction style information further includes target display screen style information;
the apparatus also includes a display module configured to:
and displaying the picture corresponding to the target picture style information.
Optionally, the apparatus further includes a second obtaining module configured to:
determining an account of the user based on the image data of the user, and acquiring target picture style information corresponding to the account;
and displaying the picture corresponding to the target picture style information.
Optionally, the apparatus further comprises an analysis module configured to:
randomly displaying pictures corresponding to the various picture style information;
acquiring image data of the user when displaying the pictures corresponding to the various picture style information;
respectively inputting the user image data corresponding to the various image style information into the user attribute state analysis model to obtain attribute state information corresponding to the various image style information, wherein the attribute state information comprises expression information;
selecting target picture style information from the multiple kinds of picture style information based on expression information corresponding to the multiple kinds of picture style information;
and correspondingly storing the target picture style information and the currently logged account.
In yet another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the operations performed by the method for processing automatic voice response as described above.
In yet another aspect, a computer-readable storage medium having at least one instruction stored therein is provided, the at least one instruction being loaded and executed by a processor to implement the operations performed by the method for automatic voice response processing as described above.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
by the aid of the method, the robot can interact with different users by using different voice styles when performing voice interaction with the users, and accordingly flexibility of automatic voice response is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method for performing automatic voice response processing according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a method for performing automatic voice response processing according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a method for performing automatic voice response processing according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an apparatus for performing automatic voice response processing according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of an apparatus provided in an embodiment of the present application;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Among other things, the present application relates to the following several directions of artificial intelligence software technology:
computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence, such as computer vision, voice technology, natural language processing, machine learning and the like, and is specifically explained by the following embodiments:
the embodiment of the application can be realized by a terminal or by the terminal and a server together. The terminal can be provided with components such as a camera, a microphone, a loudspeaker, a display screen and the like, and can be a mobile phone, a notebook computer, a desktop computer and other various intelligent devices. The terminal has a communication function and can access the Internet. When the scheme is implemented by the terminal and the server together, the server can establish network connection with the terminal to transmit data. The server may be a single server or a server group.
With the development of artificial intelligence, various robots capable of performing voice interaction with people are placed in more and more markets and banks. The robot is provided with a camera, a microphone, a loudspeaker, a display screen and other components, can identify objects, pedestrians and the like, and can communicate with a user. For example, the user may ask the location of the robot toilet in the mall, the location of each merchant, or ask a playful place in the robot mall, etc., and the robot may answer the user's question by voice after recognizing the user's question, or guide the user through the display screen. According to the method for carrying out automatic voice response processing, a proper voice style can be matched for the user according to the attribute information of the external image of the user, for example, the voice style of a cartoon image is matched for children, so that the robot interacts with the user through the proper voice style. In the embodiment of the application, the terminal is taken as an example of a robot in a market, so that the scheme is described in detail, and other situations are similar and are not described in detail.
Fig. 1 is a flowchart of a method for performing automatic voice response processing according to an embodiment of the present application. Referring to fig. 1, the embodiment includes:
step 101, acquiring image data of a user.
In practice, some store doorways may place robots, or place robots capable of moving to avoid obstacles in the stores. These robots can be used to interact with consumers in the marketplace to answer various questions of the consumer. The robot is provided with a camera, and the robot can shoot the environment in front through the camera to obtain an image of the environment in front of the robot. The robot may perform human or face detection techniques on the image to determine whether a pedestrian (i.e., user) is present in front. When the pedestrian exists at the front side, the robot can acquire the image of the user again through the camera and is used for determining the corresponding attribute state information of the user.
Step 102, determining attribute state information of a user based on image data and a pre-trained user attribute state analysis model.
The attribute state information of the user may be a user portrait corresponding to the user, wherein the attribute state information may include a user gender, an age, a skin color, an expression, a hair style, a clothing style, and the like.
In implementation, the robot acquires image data of a user through a camera, inputs the image data into a user attribute state analysis model, and outputs attribute state information of the user, wherein the output attribute state information may be a vector composed of a plurality of numerical values. For example, the vector output is [1,12,1,5,4,3], where the first value 1 may represent a male gender, the second value 12 may represent an age of 12 years, the third value 1 may represent a yellow skin tone, the fourth value 5 may represent a happy expression, the fifth value 4 may represent a short hair style, and the sixth value 3 may represent a cartoon style of apparel.
Before using the user attribute state analysis model, the user attribute state analysis model may be trained as follows: firstly, a plurality of sample image data are obtained, and attribute state information corresponding to people in the plurality of sample image data is calibrated. And then inputting the sample image data into a user attribute state analysis model to obtain an output result, adjusting parameters in the user attribute state analysis model according to the difference value of the output result and the calibrated attribute state information corresponding to the sample image data, and obtaining the trained user attribute state analysis model after repeated training of a large amount of sample image data.
It should be noted that the user attribute state analysis model may be disposed in a terminal, that is, a robot, or may be disposed in a server, and when the user attribute state analysis model is disposed in the server, the terminal may send image data of the user to the server, and the user attribute state analysis model in the server determines attribute state information of the user.
And 103, determining target interaction style information for performing automatic voice response on the user based on the attribute state information and the pre-trained interaction style analysis model.
In implementation, after obtaining the attribute state information of the user, the vector corresponding to the attribute state information of the user may be input into the interaction style analysis model, so as to obtain the target interaction style information corresponding to the attribute state information. And determining an interaction style matched with the attribute state information of the current user according to the target interaction style information.
The interaction style information may include information such as voice style information, background music style information, and display style.
The method comprises the steps that a voice style base can be established in advance in a server, or the voice style base is stored in a terminal, a plurality of voice styles are stored in the voice style base, such as the voice styles of various cartoon images, the voice styles of various stars, the voice styles of various dialects and the like, setting parameters such as the speed, the tone and the like corresponding to the voice styles are stored in the voice style base, the terminal can obtain voice audios of different styles according to different parameters, and detailed description is not given at this point.
In addition, the target interaction style information for performing automatic voice response on the user can be determined through the historical operation information of the account corresponding to the user and the attribute state information corresponding to the user. The corresponding processing may be as follows: carrying out face recognition on image data of a user; determining an account of the user based on the image data of the user, and acquiring historical operation information of the account; and determining target interaction style information for performing automatic voice response on the user based on the attribute state information, the historical operation information and a pre-trained interaction style analysis model.
In implementation, the user can register an account in advance and then upload the face image of the user to the server. The user account can record the consumption information, interested commodities and the like of the user in the shopping mall. As shown in fig. 2, before the user interacts with the robot, the robot may acquire image data of the user, upload the image data of the user to the server, and the server may input the image data of the user into the face recognition model to obtain account information corresponding to the user. Historical operating information of the account, such as purchase records, purchase time, items of interest, and the like, is obtained. And then historical operation information of the account and attribute state information (namely a user portrait) of the user acquired by the robot are simultaneously input into a pre-trained interaction style analysis model. And acquiring a target voice style of the corresponding user, then acquiring an adjusting parameter corresponding to the target voice style in the voice style library, and performing voice interaction with the user based on the target voice style.
Optionally, the target interaction style information may further include target background music style information. Background music style information is music of different styles, and the robot can play the background music simultaneously when performing voice interaction with a user. The background music style may include cartoon, popular, classic, hip hop, etc. When the interaction style includes a background music style, the corresponding process may be as follows: and playing the background music corresponding to the target background music style information.
In an implementation, after the robot acquires the attribute state information corresponding to the user, the attribute state information corresponding to the user may be input to a pre-trained interaction style analysis model, and the interaction style analysis model may output a two-dimensional vector, for example, a 2 × 50 vector is output, where a numerical value in a first row vector may represent a matching value of the current corresponding attribute state information of the user to each speech style in the speech style library, and a numerical value in a second row vector may represent a matching value of the current corresponding attribute state information of the user to each background music style.
Optionally, the target interaction style information further includes target display screen style information. The display style, i.e. the style of the display on the robot display screen, may be different styles of pictures or different videos. The display picture style can include various cartoons, classics, and the like. When the interaction style includes a background music style, the corresponding process may be as follows: and displaying the picture corresponding to the target picture style information.
In an implementation, after the robot acquires the attribute state information corresponding to the user, the attribute state information corresponding to the user may be input to a pre-trained interaction style analysis model, and the interaction style analysis model may output a two-dimensional vector, for example, a vector of 3 × 50, where a numerical value in a first row vector may represent a matching value of the current corresponding attribute state information of the user for each voice style in the voice style library, a numerical value in a second row vector may represent a matching value of the current attribute state information of the user for each background music style, a numerical value in a third row vector may represent a matching value of the current attribute state information of the user for each display style, and then a target display style corresponding to the current corresponding attribute state information of the user may be determined according to the matching values of the various display styles.
For example, a 12 year old boy stands in front of the robot to interact with the robot, and the boy wears clothes with a pattern of a organic cat, the robot can acquire image data of the boy through a camera, then inputs the image data of the boy into a user attribute state analysis model, determines attribute state information of the boy, then inputs the attribute state information of the boy into an interaction style analysis model, and the interaction style analysis model outputs target interaction style information, wherein the target interaction style information is a vector of 3 × 50, and then determines voice style information, background music style information and display picture style information of the robot interacting with the robot based on the vector of 3 × 50, for example, the boy can talk with the voice of a robot cat style, and the robot can play theme music of the robot cat and display a picture of the robot cat on a robot display screen.
In another possible implementation manner, the server may store the interested image style corresponding to the account in advance, and the corresponding processing may be as follows: after the image data of the user is obtained, determining an account of the user based on the image data of the user, and obtaining target picture style information corresponding to the account; and displaying the picture corresponding to the target picture style information.
In an implementation, visual style information of interest for each account may be stored in the server. After the robot acquires the image data of the user, face recognition can be performed on the user, and the account corresponding to the user is determined. For example, feature extraction is performed on the acquired image data of the user to obtain feature information of the face of the user, then the feature information of the face of the user is uploaded to a server, and the server determines an account corresponding to the user through feature comparison. And acquiring the interested picture style information corresponding to the account, then sending the corresponding picture to the robot, and displaying the corresponding picture by a display screen of the robot.
In addition, the background music style or the display screen style that may be of interest to the user may be determined by obtaining the user's reaction to the background music or the displayed screen currently being played by the robot, for example, the expression after seeing the robot display screen. The corresponding processing is as follows: randomly displaying pictures corresponding to the various picture style information; acquiring image data of a user when a picture corresponding to a plurality of picture style information is displayed; respectively inputting image data of a user corresponding to the various image style information into a user attribute state analysis model to obtain attribute state information corresponding to the various image style information, wherein the attribute state information comprises expression information; selecting target picture style information from the multiple kinds of picture style information based on expression information corresponding to the multiple kinds of picture style information; and correspondingly storing the target picture style information and the currently logged account.
In implementation, the server may not store the images in which the user is interested, or obtain the display style information corresponding to the attribute state information through the interaction style analysis model. Pictures of different picture styles can be randomly displayed on the display screen of the robot. When the picture is displayed, the image data of the user can be acquired through the camera, and then the image data of the user is input into the user attribute state analysis model to obtain the attribute state information of the user. The attribute state information of the user includes expression information, that is, the state of the user's current expression, for example, difficulty, generality, happiness, and surprise. The robot can determine that the user is interested in the currently displayed picture according to the expression information of the user, namely when the current expression of the user is happy or surprised. For example, when a user is a child aged 10 and has a conversation with the robot, and when an animation of a robotic cat is displayed on a screen of the robot, the robot acquires image data of the user and analyzes the image data to obtain an expression of the user as a surprise, a label of the robotic cat may be added to an account corresponding to the user.
In addition, the server may also pre-store the interesting background music style corresponding to the account, and the corresponding processing is as follows: determining an account of the user based on the image data of the user, and acquiring target background music style information corresponding to the account; and playing the background music corresponding to the background music style information. Similarly, the background music style that may be of interest to the user may be determined by obtaining the background music currently played by the user on the current robot. The corresponding processing is as follows: randomly playing music corresponding to various background music style information; acquiring image data of a user when music corresponding to various background music style information is played; respectively inputting image data of a user corresponding to the various image style information into a user attribute state analysis model to obtain attribute state information corresponding to the various image style information, wherein the attribute state information comprises expression information; selecting target background music style information from the multiple kinds of background music style information based on expression information corresponding to the multiple kinds of background music style information; and correspondingly storing the target background music style information and the currently logged account.
And step 104, performing automatic voice response processing based on the target interaction style information.
In implementation, after the target interaction style information is determined, a voice response can be made with the user according to the target interaction style information. The interaction style information may include voice style information, and may include at least one of background music style information and display style information. When the interaction style information comprises background music style information, the robot plays background music corresponding to the target background music style information when performing voice interaction with the user. When the interaction style information comprises display picture style information, the robot displays a display picture corresponding to the target display picture style information in the display screen when performing voice interaction with the user.
Based on the target voice style information, the automatic voice response processing steps may be as follows: acquiring a user voice audio; identifying the user audio to generate corresponding characters; determining target interactive characters based on the characters and a pre-trained dialogue model; converting the target interactive characters into target response voice audio corresponding to the target voice style information based on a voice synthesis algorithm and the adjusting parameters corresponding to the target voice style information; and playing the target response voice audio.
In implementation, when the user starts to work with the robot, the robot can acquire the voice audio of the user speaking, and then input the voice of the user into the voice recognition model to obtain the text information of the content spoken by the user. Then extracting the semantics in the character information, inputting the corresponding semantics into a pre-trained dialogue model, and outputting the character information of the answering user. For example, the user asks the location of the toilet in the mall, and the robot can recognize the user's voice and then determine the text to speak to the user based on the dialog model, e.g., at the southeast corner of each floor. When the robot obtains characters for dialogue with the user, the characters can be converted into voice audio according to a voice synthesis algorithm, and then the voice audio is played so as to complete the dialogue with the user. The speech synthesis algorithm includes parameters for adjusting speech audio, such as parameters for adjusting pitch, parameters for adjusting speech rate, and parameters for adjusting mood. Different voice styles are correspondingly stored in the voice style library corresponding to different adjusting parameters. After the robot determines the target voice style information interacting with the user, the corresponding adjusting parameters can be obtained from the voice style library, then the corresponding adjusting parameters are input into an adjusting algorithm, and finally the audio corresponding to the target voice style is obtained through the adjusting algorithm.
In addition, the technician can update the voice styles in the established voice style library, and the process can be as shown in fig. 3: firstly, collecting linguistic data, i.e. various styles of voice and audio, such as various cartoon image voices and dialects of various regions, then adding corresponding labels to the linguistic data through an analysis algorithm, judging the voice styles of the linguistic data to obtain adjusting parameters corresponding to the voice styles, then adding the adjusting parameters to a voice style library, and updating the voice styles
According to the method and the device, the attribute state information of the user is obtained through the attribute state analysis model, and the target interaction style information for carrying out automatic voice response with the user is determined according to the attribute state information and the interaction style analysis model.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
Fig. 4 is a schematic structural diagram of an apparatus for performing automatic voice response processing according to an embodiment of the present application, where the structure may be a terminal in the foregoing embodiment, as shown in fig. 4, the apparatus includes:
an acquisition module 410 configured to acquire image data of a user;
a first determination module 420 configured to determine attribute state information of the user based on the image data and a pre-trained user attribute state analysis model;
a second determining module 430 configured to determine target interaction style information for performing an automatic voice response for the user based on the attribute state information and a pre-trained interaction style analysis model;
and the processing module 440 is configured to perform automatic voice response processing based on the target interaction style information.
Optionally, the apparatus further comprises an identification module configured to:
carrying out face recognition on the image data of the user;
determining an account of the user based on the image data of the user, and acquiring historical operation information of the account;
the second determining module 430 is configured to:
and determining target interaction style information for performing automatic voice response on the user based on the attribute state information, the historical operation information and a pre-trained interaction style analysis model.
Optionally, the target interaction style information includes target voice style information;
the processing module 440 includes:
acquiring a user voice audio;
identifying the user audio to generate corresponding characters;
determining target interactive characters based on the characters and a pre-trained dialogue model;
converting the target interactive characters into target response voice audio corresponding to the target voice style information based on a voice synthesis algorithm and the adjusting parameters corresponding to the target voice style information;
and playing the target response voice audio.
Optionally, the target interaction style information further includes target background music style information;
the apparatus also includes a play module configured to:
and playing the background music corresponding to the target background music style information.
Optionally, the target interaction style information further includes target display screen style information;
the apparatus also includes a display module configured to:
and displaying the picture corresponding to the target picture style information.
Optionally, the apparatus further includes a second obtaining module configured to:
determining an account of the user based on the image data of the user, and acquiring target picture style information corresponding to the account;
and displaying the picture corresponding to the target picture style information.
Optionally, the apparatus further comprises an analysis module configured to:
randomly displaying pictures corresponding to the various picture style information;
acquiring image data of the user when displaying the pictures corresponding to the various picture style information;
respectively inputting the user image data corresponding to the various image style information into the user attribute state analysis model to obtain attribute state information corresponding to the various image style information, wherein the attribute state information comprises expression information;
selecting target picture style information from the multiple kinds of picture style information based on expression information corresponding to the multiple kinds of picture style information;
and correspondingly storing the target picture style information and the currently logged account.
It should be noted that: in the apparatus for performing automatic voice response processing according to the foregoing embodiment, when performing automatic voice response processing, only the division of the functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the apparatus for performing automatic voice response processing and the method embodiment for performing automatic voice response processing provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
Fig. 5 shows a block diagram of a device provided in an exemplary embodiment of the present application. The apparatus may be a terminal 500, and may include: a robot, smartphone, tablet, laptop or desktop computer, or other form of smart device.
In general, the terminal 500 includes: a processor 501 and a memory 502.
The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the method for automatic voice response processing provided by method embodiments herein.
In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a touch screen display 505, a camera 506, audio circuitry 507, a positioning component 508, and a power supply 509.
The display screen 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions.
Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance.
The positioning component 508 is used to locate the current geographic position of the terminal 500 for navigation or LBS (location based Service). The positioning component 508 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.
Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 601 and one or more memories 602, where at least one instruction is stored in the memory 602, and the at least one instruction is loaded and executed by the processor 601 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of automatic voice response processing in the following embodiments is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for automated voice response processing, the method comprising:
acquiring image data of a user;
determining attribute state information of the user based on the image data and a pre-trained user attribute state analysis model;
determining target interaction style information for performing automatic voice response on the user based on the attribute state information and a pre-trained interaction style analysis model;
and performing automatic voice response processing based on the target interaction style information.
2. The method of claim 1, wherein after the obtaining image data of the user, the method further comprises:
carrying out face recognition on the image data of the user;
determining an account of the user based on the image data of the user, and acquiring historical operation information of the account;
the determining target interaction style information for performing automatic voice response on the user based on the attribute state information and a pre-trained interaction style analysis model comprises:
and determining target interaction style information for performing automatic voice response on the user based on the attribute state information, the historical operation information and a pre-trained interaction style analysis model.
3. The method of claim 1, wherein the target interaction style information comprises target speech style information;
the automatic voice response processing based on the target voice style information comprises the following steps:
acquiring a user voice audio;
identifying the user audio to generate corresponding characters;
determining target interactive characters based on the characters and a pre-trained dialogue model;
converting the target interactive characters into target response voice audio corresponding to the target voice style information based on a voice synthesis algorithm and the adjusting parameters corresponding to the target voice style information;
and playing the target response voice audio.
4. The method of claim 3, wherein the target interaction style information further comprises target background music style information;
the method further comprises the following steps:
and playing the background music corresponding to the target background music style information.
5. The method of claim 3, wherein the target interaction style information further comprises target display style information;
the method further comprises the following steps:
and displaying the picture corresponding to the target picture style information.
6. The method of claim 1, wherein after the obtaining image data of the user, the method further comprises:
determining an account of the user based on the image data of the user, and acquiring target picture style information corresponding to the account;
and displaying the picture corresponding to the target picture style information.
7. The method of claim 1, wherein prior to the obtaining image data of the user, the method further comprises:
randomly displaying pictures corresponding to the various picture style information;
acquiring image data of the user when displaying the pictures corresponding to the various picture style information;
respectively inputting the user image data corresponding to the various image style information into the user attribute state analysis model to obtain attribute state information corresponding to the various image style information, wherein the attribute state information comprises expression information;
selecting target picture style information from the multiple kinds of picture style information based on expression information corresponding to the multiple kinds of picture style information;
and correspondingly storing the target picture style information and the currently logged account.
8. An apparatus for performing automatic voice response processing, the apparatus comprising:
an acquisition module configured to acquire image data of a user;
a first determination module configured to determine attribute state information of the user based on the image data and a pre-trained user attribute state analysis model;
a second determination module configured to determine target interaction style information for performing an automatic voice response for the user based on the attribute state information and a pre-trained interaction style analysis model;
and the processing module is configured to perform automatic voice response processing based on the target interaction style information.
9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to perform operations performed by the method of performing automatic voice response processing according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to perform operations performed by the method for automated voice response processing according to any one of claims 1 to 7.
CN202010114987.5A 2020-02-25 2020-02-25 Method, device, equipment and storage medium for automatic voice response processing Active CN111327772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010114987.5A CN111327772B (en) 2020-02-25 2020-02-25 Method, device, equipment and storage medium for automatic voice response processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010114987.5A CN111327772B (en) 2020-02-25 2020-02-25 Method, device, equipment and storage medium for automatic voice response processing

Publications (2)

Publication Number Publication Date
CN111327772A true CN111327772A (en) 2020-06-23
CN111327772B CN111327772B (en) 2021-09-17

Family

ID=71171178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010114987.5A Active CN111327772B (en) 2020-02-25 2020-02-25 Method, device, equipment and storage medium for automatic voice response processing

Country Status (1)

Country Link
CN (1) CN111327772B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148849A (en) * 2020-09-08 2020-12-29 北京百度网讯科技有限公司 Dynamic interaction method, server, electronic device and storage medium
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112528004A (en) * 2020-12-24 2021-03-19 北京百度网讯科技有限公司 Voice interaction method, voice interaction device, electronic equipment, medium and computer program product
CN112927033A (en) * 2021-01-27 2021-06-08 上海商汤智能科技有限公司 Data processing method and device, electronic equipment and storage medium
CN113208592A (en) * 2021-03-29 2021-08-06 济南大学 Psychological test system with multiple answering modes

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091153A (en) * 2014-07-03 2014-10-08 苏州工业职业技术学院 Emotion judgment method applied to chatting robot
US20160162807A1 (en) * 2014-12-04 2016-06-09 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation Emotion Recognition System and Method for Modulating the Behavior of Intelligent Systems
CN106663127A (en) * 2016-07-07 2017-05-10 深圳狗尾草智能科技有限公司 An interaction method and system for virtual robots and a robot
CN108187332A (en) * 2018-01-08 2018-06-22 杭州赛鲁班网络科技有限公司 A kind of intelligent body-building interaction systems based on face recognition technology
CN108363492A (en) * 2018-03-09 2018-08-03 南京阿凡达机器人科技有限公司 A kind of man-machine interaction method and interactive robot
CN108733209A (en) * 2018-03-21 2018-11-02 北京猎户星空科技有限公司 Man-machine interaction method, device, robot and storage medium
CN109727091A (en) * 2018-12-14 2019-05-07 平安科技(深圳)有限公司 Products Show method, apparatus, medium and server based on dialogue robot
CN110189754A (en) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 Voice interactive method, device, electronic equipment and storage medium
CN110197659A (en) * 2019-04-29 2019-09-03 华为技术有限公司 Feedback method, apparatus and system based on user's portrait
CN110265021A (en) * 2019-07-22 2019-09-20 深圳前海微众银行股份有限公司 Personalized speech exchange method, robot terminal, device and readable storage medium storing program for executing
CN110569726A (en) * 2019-08-05 2019-12-13 北京云迹科技有限公司 interaction method and system for service robot
CN110610703A (en) * 2019-07-26 2019-12-24 深圳壹账通智能科技有限公司 Speech output method, device, robot and medium based on robot recognition
US20200005781A1 (en) * 2018-06-29 2020-01-02 Beijing Baidu Netcom Science Technology Co., Ltd. Human-machine interaction processing method and apparatus thereof

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091153A (en) * 2014-07-03 2014-10-08 苏州工业职业技术学院 Emotion judgment method applied to chatting robot
US20160162807A1 (en) * 2014-12-04 2016-06-09 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation Emotion Recognition System and Method for Modulating the Behavior of Intelligent Systems
CN106663127A (en) * 2016-07-07 2017-05-10 深圳狗尾草智能科技有限公司 An interaction method and system for virtual robots and a robot
CN108187332A (en) * 2018-01-08 2018-06-22 杭州赛鲁班网络科技有限公司 A kind of intelligent body-building interaction systems based on face recognition technology
CN108363492A (en) * 2018-03-09 2018-08-03 南京阿凡达机器人科技有限公司 A kind of man-machine interaction method and interactive robot
CN108733209A (en) * 2018-03-21 2018-11-02 北京猎户星空科技有限公司 Man-machine interaction method, device, robot and storage medium
US20200005781A1 (en) * 2018-06-29 2020-01-02 Beijing Baidu Netcom Science Technology Co., Ltd. Human-machine interaction processing method and apparatus thereof
CN109727091A (en) * 2018-12-14 2019-05-07 平安科技(深圳)有限公司 Products Show method, apparatus, medium and server based on dialogue robot
CN110197659A (en) * 2019-04-29 2019-09-03 华为技术有限公司 Feedback method, apparatus and system based on user's portrait
CN110189754A (en) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 Voice interactive method, device, electronic equipment and storage medium
CN110265021A (en) * 2019-07-22 2019-09-20 深圳前海微众银行股份有限公司 Personalized speech exchange method, robot terminal, device and readable storage medium storing program for executing
CN110610703A (en) * 2019-07-26 2019-12-24 深圳壹账通智能科技有限公司 Speech output method, device, robot and medium based on robot recognition
CN110569726A (en) * 2019-08-05 2019-12-13 北京云迹科技有限公司 interaction method and system for service robot

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148849A (en) * 2020-09-08 2020-12-29 北京百度网讯科技有限公司 Dynamic interaction method, server, electronic device and storage medium
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112528004A (en) * 2020-12-24 2021-03-19 北京百度网讯科技有限公司 Voice interaction method, voice interaction device, electronic equipment, medium and computer program product
CN112927033A (en) * 2021-01-27 2021-06-08 上海商汤智能科技有限公司 Data processing method and device, electronic equipment and storage medium
CN113208592A (en) * 2021-03-29 2021-08-06 济南大学 Psychological test system with multiple answering modes

Also Published As

Publication number Publication date
CN111327772B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN111327772B (en) Method, device, equipment and storage medium for automatic voice response processing
JP7130057B2 (en) Hand Keypoint Recognition Model Training Method and Device, Hand Keypoint Recognition Method and Device, and Computer Program
WO2020233464A1 (en) Model training method and apparatus, storage medium, and device
EP3929703A1 (en) Animation image driving method based on artificial intelligence, and related device
CN110379430B (en) Animation display method and device based on voice, computer equipment and storage medium
CN112379812B (en) Simulation 3D digital human interaction method and device, electronic equipment and storage medium
EP3882860A2 (en) Method, apparatus, device, storage medium and program for animation interaction
EP4006901A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
US20230042654A1 (en) Action synchronization for target object
US20220172737A1 (en) Speech signal processing method and speech separation method
CN111063342B (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN110163054A (en) A kind of face three-dimensional image generating method and device
CN113763532B (en) Man-machine interaction method, device, equipment and medium based on three-dimensional virtual object
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN111541951B (en) Video-based interactive processing method and device, terminal and readable storage medium
CN113750523A (en) Motion generation method, device, equipment and storage medium for three-dimensional virtual object
CN113705316A (en) Method, device and equipment for acquiring virtual image and storage medium
CN111091166A (en) Image processing model training method, image processing device, and storage medium
CN110322760A (en) Voice data generation method, device, terminal and storage medium
CN109343695A (en) Exchange method and system based on visual human's behavioral standard
CN113392687A (en) Video title generation method and device, computer equipment and storage medium
CN113705302A (en) Training method and device for image generation model, computer equipment and storage medium
CN113744286A (en) Virtual hair generation method and device, computer readable medium and electronic equipment
CN113205569A (en) Image drawing method and device, computer readable medium and electronic device
CN113821658A (en) Method, device and equipment for training encoder and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40023565

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant