CN113436602A - Virtual image voice interaction method and device, projection equipment and computer medium - Google Patents

Virtual image voice interaction method and device, projection equipment and computer medium Download PDF

Info

Publication number
CN113436602A
CN113436602A CN202110680196.3A CN202110680196A CN113436602A CN 113436602 A CN113436602 A CN 113436602A CN 202110680196 A CN202110680196 A CN 202110680196A CN 113436602 A CN113436602 A CN 113436602A
Authority
CN
China
Prior art keywords
information
voice
user
avatar
answering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110680196.3A
Other languages
Chinese (zh)
Inventor
李禹�
曹琦
王骁逸
张聪
胡震宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huole Science and Technology Development Co Ltd
Original Assignee
Shenzhen Huole Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huole Science and Technology Development Co Ltd filed Critical Shenzhen Huole Science and Technology Development Co Ltd
Priority to CN202110680196.3A priority Critical patent/CN113436602A/en
Publication of CN113436602A publication Critical patent/CN113436602A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/12Picture reproducers
    • H04N9/31Projection devices for colour picture display, e.g. using electronic spatial light modulators [ESLM]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present disclosure provides an avatar voice interaction method, apparatus, projection device and computer medium; the virtual image voice interaction method in the present disclosure includes: picking up user voice information through a microphone device, analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to voiceprint feature information of the user voice information to generate answering voice information; acquiring an avatar corresponding to the user voice information, fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through projection equipment to perform avatar voice interaction; in the method and the system, the answering voice information is generated according to the user voice information, the virtual image and the answering voice information are fused, and the reality sense of man-machine interaction is strong.

Description

Virtual image voice interaction method and device, projection equipment and computer medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for avatar speech interaction, a projection device, and a computer medium.
Background
In recent years, with the continuous progress of computer voice synthesis and video synthesis technologies, various avatar synthesis technologies have been developed. The avatar may perform tasks such as news broadcasting, weather forecasting, game commentary, providing ordering services, etc.
The virtual image visualization voice interaction becomes a hot spot concerned by people. The visual voice interaction is a man-machine interaction mode for playing response voice through an avatar. At present, although the visual voice interaction can combine the common user and the computer closely through natural language recognition, understanding and synthesis. However, when the avatar simulates a real person to perform voice interaction, most of them only synthesize a mouth shape matching the output voice, and the avatar always maintains a neutral expression, or sets several basic expressions in advance, so that the reality of human-computer interaction is reduced.
Disclosure of Invention
The disclosure provides an avatar voice interaction method, an avatar voice interaction device, a projection device and a computer medium, and aims to solve the technical problem that the reality sense of human-computer interaction in an avatar voice interaction process of the existing projection device is poor.
In one aspect, the present disclosure provides an avatar voice interaction method applied to a projection device, the avatar voice interaction method including:
picking up user voice information through a microphone device;
analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
acquiring an avatar corresponding to the user voice information;
and fusing the answering voice information and the virtual image and outputting the information through projection equipment to perform virtual image voice interaction.
Optionally, the analyzing the user voice information to obtain answer text information, and performing voice conversion on the answer text information according to the voiceprint feature information of the user voice information to generate answer voice information, including:
inputting the user voice information into a preset voice recognition model to obtain voice text information;
recognizing the voice text information to obtain answering text information;
and extracting voiceprint characteristic information in the user voice information, and carrying out voice conversion on the answering text information according to the voiceprint characteristic information to obtain answering voice information.
Optionally, the fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through a projection device to perform avatar voice interaction, including:
sequentially extracting each pair of answering voice phonemes in the answering voice information, inquiring a preset phoneme and lip mapping relation, and obtaining a target lip matched with each answering voice phoneme;
and adjusting the mouth shape of the virtual image according to the target lip shape, and outputting the answering voice information and the adjusted virtual image synchronously through projection equipment to perform virtual image voice interaction.
Optionally, the obtaining an avatar corresponding to the user voice information includes:
inputting the user voice information into a preset voice recognition model to obtain voiceprint characteristic information;
and inquiring a preset database to obtain the virtual image corresponding to the voiceprint characteristic information.
Optionally, before querying a preset database and obtaining an avatar corresponding to the voiceprint feature information, the method includes:
receiving an account registration request, acquiring an account identification input by a user, and acquiring user voice information and user image information corresponding to the account identification;
analyzing the user image information to obtain face feature information and stature proportion information;
inputting the face feature information and the stature proportion information into a preset three-dimensional image model to obtain three-dimensional animation information;
rendering the three-dimensional animation information to generate a virtual image;
and extracting voiceprint characteristic information of the user voice information, and storing the account identification, the voiceprint characteristic information, the face characteristic information and the virtual image in a preset database in a correlated manner.
Optionally, the fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through a projection device to perform avatar voice interaction, including:
obtaining the sound source position of the user voice information according to the positions of at least two microphone devices and the acquisition time of each microphone device for acquiring the user voice information
Adjusting the virtual image to face the sound source position, and rendering the adjusted virtual image;
and fusing the answering voice information and the adjusted virtual image and outputting the fused information through projection equipment to perform virtual image voice interaction.
Optionally, the fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through a projection device to perform avatar voice interaction, including:
determining a sound source position from the position of the microphone arrangement;
collecting user image information of a target user at a position corresponding to the sound source position through a camera device;
inputting the user image information into a preset face recognition model to obtain face feature information;
and updating the virtual image according to the face feature information, fusing the answering voice information and the updated virtual image and outputting the fused answering voice information and the updated virtual image through projection equipment so as to perform virtual image voice interaction.
In another aspect, the present disclosure further provides an avatar voice interaction apparatus, including:
the information acquisition module is used for picking up voice information of a user through the microphone device;
the information adjusting module is used for analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
the image acquisition module is used for acquiring a virtual image corresponding to the user voice information;
and the fusion output module is used for fusing the answering voice information and the virtual image and outputting the fused information and the virtual image through projection equipment so as to perform virtual image voice interaction.
In another aspect, the present disclosure further provides a projection apparatus, where:
one or more processors;
a memory; and
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the avatar voice interaction method.
In another aspect, the present disclosure also provides a computer medium having a computer program stored thereon, where the computer program is loaded by a processor to execute the steps of the avatar voice interaction method.
The virtual image voice interaction method provided by the present disclosure comprises: picking up user voice information through a microphone device; analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information; acquiring an avatar corresponding to the user voice information, fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through projection equipment to perform avatar voice interaction; in the embodiment of the disclosure, the answering voice information is generated according to the user voice information, and the virtual image and the answering voice information are fused, so that the voice is matched with the expression of the virtual image, the reality sense of human-computer interaction is enhanced, and the user experience is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a scene diagram of an avatar voice interaction method provided by an embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating an embodiment of an avatar voice interaction method in accordance with an embodiment of the present disclosure;
fig. 3 is a schematic flow chart illustrating an embodiment of a method for performing voice interaction by querying a preset database with a projection device in an avatar voice interaction method according to the embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating an embodiment of avatar update in a default database in the avatar voice interaction method provided in embodiments of the present disclosure;
FIG. 5 is a schematic flow chart diagram illustrating an embodiment of a projection device interacting with different avatars in a method for interacting with avatars by voice according to the embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating one embodiment of user registration and avatar generation in the avatar voice interaction method provided in embodiments of the present disclosure;
FIG. 7 is a diagram illustrating a specific scenario of an embodiment of avatar generation in an avatar voice interaction method provided in an embodiment of the present disclosure;
FIG. 8 is a schematic flow chart diagram illustrating one embodiment of a projection device for acoustic addressing for targeted user voice interaction in a method for avatar voice interaction provided in embodiments of the present disclosure;
FIG. 9 is a schematic flowchart of an embodiment in which the projection device implements a combination of sound addressing and avatar update in the avatar voice interaction method provided in the embodiments of the present disclosure;
FIG. 10 is a schematic structural diagram illustrating one embodiment of an avatar voice interaction apparatus provided in embodiments of the present disclosure;
fig. 11 is a schematic structural diagram of an embodiment of a projection device provided in an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of the present disclosure.
In the description of the present disclosure, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of describing and simplifying the description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be construed as limiting the present disclosure. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.
In the present disclosure, the word "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described in this disclosure as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the disclosure. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without these specific details. In other instances, well-known structures and processes are not set forth in detail in order to avoid obscuring the description of the present disclosure with unnecessary detail. Thus, the present disclosure is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The embodiments of the present disclosure provide a method and an apparatus for avatar voice interaction, a projection device, and a computer medium, which are described in detail below.
The avatar voice interaction method in the disclosed embodiment is applied to an avatar voice interaction apparatus, the avatar voice interaction apparatus is disposed in a projection device, the projection device is disposed therein with one or more processors, a memory, and one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the avatar voice interaction method; the projection device may be a terminal, such as a cell phone or a tablet computer.
As shown in fig. 1, fig. 1 is a scene schematic diagram of an avatar voice interaction method according to an embodiment of the present disclosure, where the avatar voice interaction scene includes a projection device 100 (an avatar voice interaction apparatus is integrated in the projection device 100), and a computer medium corresponding to avatar voice interaction is run in the projection device 100 to perform a step of avatar voice interaction.
It should be understood that the projection device in the scene of the avatar voice interaction method shown in fig. 1, or the apparatus included in the projection device, does not constitute a limitation to the embodiments of the present disclosure, that is, the number of devices and the type of projection device included in the scene of the avatar voice interaction method, or the number of apparatuses and the type of apparatus included in each device do not affect the overall implementation of the technical solution in the embodiments of the present disclosure, and may be calculated as equivalent replacements or derivatives of the technical solutions claimed in the embodiments of the present disclosure.
The projection device 100 in the embodiment of the present disclosure is mainly used for: picking up voice information of a user through a microphone device, and acquiring image information of the user through a preset camera device; analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information; and generating an avatar according to the user image information, fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through projection equipment so as to perform avatar voice interaction.
The projection device 100 in the embodiment of the present disclosure may be an independent projection device, or may be a projection device network or a projection device cluster composed of projection devices, for example, the projection device 100 described in the embodiment of the present disclosure includes, but is not limited to, a computer, a network host, a single network projection device, multiple network projection device sets, or a cloud projection device composed of multiple projection devices. Among them, the cloud projection apparatus is constituted by a large number of computers or network projection apparatuses based on cloud computing (cloud computing).
Those skilled in the art can understand that the application environment shown in fig. 1 is only one application scenario related to the present disclosure, and does not constitute a limitation on the application scenario of the present disclosure, and that other application environments may further include more or less projection devices than those shown in fig. 1, or a network connection relationship of projection devices, for example, only 1 projection device is shown in fig. 1, and it can be understood that the scenario of the avatar voice interaction method may further include one or more other projection devices, which is not limited herein; the projection device 100 may also include a memory for storing data.
In addition, in the scene of the avatar voice interaction method of the present disclosure, the projection apparatus 100 may be provided with a display device, or the projection apparatus 100 is not provided with a display device in communication connection with an external display device 200, and the display device 200 is configured to output a result of the avatar voice interaction method executed in the projection apparatus. For example, the display device 200 may be a display or a projection screen, the projection apparatus 100 may access the background database 300, the background database 300 may be a local memory of the projection apparatus, the background database may be further disposed in the cloud, and the background database 300 stores information related to avatar voice interaction.
It should be noted that the scene diagram of the avatar voice interaction method shown in fig. 1 is only an example, and the scene of the avatar voice interaction method described in the embodiment of the present disclosure is for more clearly explaining the technical solution of the embodiment of the present disclosure, and does not constitute a limitation to the technical solution provided by the embodiment of the present disclosure.
Based on the scene of the virtual image voice interaction method, the embodiment of the virtual image voice interaction method is provided.
As shown in fig. 2, fig. 2 is a flowchart illustrating an embodiment of an avatar voice interaction method in an embodiment of the present disclosure.
The virtual image voice interaction method in the embodiment comprises the following steps of 201-204:
the user speech information is picked up by a microphone device 201.
The virtual image voice interaction method in this embodiment is applied to a projection device, also called a projector or a projector, and is a device capable of projecting an image or a Video onto a screen, and can be connected with a computer, a VCD (Video Compact Disc, chinese full name: Video Disc), a DVD (Digital Video Disc, chinese full name: high-density Digital Video Disc), a game console, and the like through different interfaces to play corresponding Video signals.
The projection equipment receives a voice interaction instruction, wherein the triggering mode of the voice interaction instruction is not particularly limited, that is, the voice interaction instruction can be actively triggered by a user, for example, XX is spoken by the user, and then the voice interaction instruction is triggered; the voice interaction instruction can also be automatically triggered by the projection device, for example, the voice interaction instruction is preset in the projection device to be automatically triggered when human-shaped image information or user voice information is detected.
After the projection equipment receives the voice interaction instruction, the projection equipment picks up the voice information of the user through a microphone device, and the microphone can be arranged in the projector or can be an independent equipment which is connected with the projector for communication; in addition, the projection equipment can collect user image information through the camera device, the camera device can be arranged in the projection equipment and can also be in communication connection with the projection equipment, the camera device can be an infrared camera, a common camera or other sensing detection devices, the user image information can be collected by the preset camera device, or the user video information is collected by the preset camera device, and the user video information is subjected to framing processing by the projection equipment to obtain the user image information.
In the embodiment, the projection equipment picks up the voice information of the user through the microphone device, and acquires the image information of the user through the preset camera device so as to generate the virtual image according to the image information of the user and perform voice interaction.
And 202, analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information.
The projection equipment analyzes the user voice information to obtain the text type of the user voice information, and the projection equipment determines whether the user voice information is an operation instruction of the projection equipment or not according to the text type; if the user voice information is the operation instruction of the projection equipment, responding to the operation instruction; if the user voice information is not the operation instruction of the projection device, the projection device obtains the semantic corresponding to the user voice information, and the projection device determines the answering text information according to the semantic corresponding to the user voice information, for example, the user voice information is: today is the weather good? The answer text is: the weather is sunny today, and the air temperature is 25 ℃; the projection equipment extracts voiceprint characteristic information of the user voice information, wherein the voiceprint characteristic information refers to characteristic information such as tone, tone and the like of sound, the voiceprint characteristic information can be determined according to a frequency spectrum or a phase spectrum of the user voice information, and after the projection equipment obtains the voiceprint characteristic information corresponding to the user voice information, the projection equipment carries out voice conversion on the answering text information according to the voiceprint characteristic information of the user voice information to generate answering voice information; specifically, the method comprises the following steps:
(1) inputting the user voice information into a preset voice recognition model to obtain voice text information;
(2) identifying the voice text information to acquire answering text information;
(3) and extracting voiceprint characteristic information in the user voice information, and carrying out voice conversion on the answering text information according to the voiceprint characteristic information to obtain answering voice information.
The projection equipment is used for inputting user voice information into the preset voice recognition model, the preset voice recognition model firstly performs window division on the user voice information, and the preset voice recognition model in the projection equipment performs voice recognition according to a window division waveform to obtain voice text information corresponding to the user voice information.
The projection equipment identifies the voice text information through a pre-trained neural network model to obtain the answering text information of the voice text information, or the projection equipment inquires a preset question-answer mapping relation to obtain the answering text information corresponding to the voice text information; and the projection equipment extracts voiceprint characteristic information in the user voice information, and carries out voice conversion on the answering text information according to the voiceprint characteristic information to obtain answering voice information.
And 203, acquiring an avatar corresponding to the voice information of the user.
The projection equipment acquires an avatar corresponding to user voice information, namely, the avatar is pre-constructed in the embodiment, the avatar is generated according to user image information, the avatar can be a two-dimensional avatar or a three-dimensional avatar, for example, the two-dimensional avatar is a two-dimensional animation of a user, the three-dimensional avatar is a three-dimensional animation, the projection equipment generates the avatar according to the user image information in the embodiment, for example, the projection equipment constructs an initial three-dimensional face model of a target user corresponding to the user image information according to the user image information and a three-dimensional face basic model; the projection equipment determines face attribute information according to user image information; the projection equipment adjusts the initial three-dimensional face model based on the face attribute information so that the adjusted target three-dimensional face model contains information matched with the face attribute information, the adjusted three-dimensional face model is used as a face image of a target user, and further, animation corresponding to the body is added to the face image by the projection equipment to obtain a three-dimensional virtual image, so that the virtual image is more vivid.
For convenience of understanding, the present implementation presents two specific implementations of avatar generation, including:
the implementation mode is as follows:
(1) analyzing the user image information to obtain face feature information and figure proportion information;
(2) inputting the face feature information and the figure proportion information into a preset image generation model to obtain a virtual image;
(3) sequentially extracting each pair of answering voice phonemes in the answering voice information, inquiring a preset phoneme and lip mapping relation, and obtaining a target lip matched with each answering voice phoneme;
(4) and adjusting the mouth shape and the expression of the virtual image according to the target lip shape, and outputting the answering voice information and the adjusted virtual image synchronously through projection equipment so as to perform virtual image voice interaction.
The projection equipment inputs user image information into the preset image analysis model, processes the user image information through the preset image analysis model, analyzes the user image information, and obtains face feature information and figure proportion information; the face feature information refers to skin color, outline, interocular distance, position information of five sense organs and the like of the face, and the figure proportion information refers to head-body ratio, height, leg length ratio and the like. The method comprises the steps that a projection device counts pixel values of all pixel points in a face area in user image information, the projection device determines skin color of a face according to the pixel values of the pixel points, the projection device marks characteristic points of the face area in the user image information, the projection device obtains coordinates of all the characteristic points, the projection device obtains outlines, eye distances, position information of five sense organs and the like according to the positions of all the characteristic points, the projection device analyzes multi-frame user image information, the projection device selects a reference object in the multi-frame user image information, the projection device determines height of a user according to the reference object, and body figure proportion information is further determined.
The projection equipment is internally provided with a preset image generation model, the preset image generation model refers to a virtual image generation algorithm which is constructed in advance, and the projection equipment inputs the face feature information and the figure proportion information into the preset image generation model to obtain a virtual image.
The implementation mode two is as follows:
(1) inputting the user image information into a preset three-dimensional image model to obtain three-dimensional animation information;
(2) rendering the three-dimensional animation information to generate an avatar, fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through projection equipment to perform avatar voice interaction.
The projection equipment presets a three-dimensional image model, the preset three-dimensional image model refers to a preset three-dimensional conversion model, for example, the preset three-dimensional image model is a 3DMM (English full name: 3DMorphableModels, Chinese full name: three-dimensional image model), the projection equipment inputs user image information into the preset three-dimensional image model to obtain three-dimensional animation information, namely, the projection equipment stretches, sections and rotates a basic geometric body (cube, cylinder or sphere) according to the image information through the preset three-dimensional image model to synthesize an initial virtual image model, the projection equipment determines each control point in the initial virtual image according to the human skeleton motion relationship, then sets the control point to be directly bound with muscles and bones in the initial virtual image, the projection equipment collects multi-frame user image information to determine human action information, the projection equipment displays the bound initial virtual image according to human actions, and forming three-dimensional animation information, and rendering the three-dimensional animation information by the projection equipment to generate an avatar.
204, fusing the answering voice information and the virtual image and outputting the information through a projection device to perform virtual image voice interaction.
The projection equipment fuses the answering voice information and the virtual image and outputs the information through the projection equipment to perform virtual image voice interaction, namely, the projection equipment adjusts the mouth shape of the virtual image according to the answering voice information, the projection equipment acquires the time of each phoneme or syllable in the answering voice information, and the projection equipment adjusts the mouth shape of the virtual image according to the time, so that the answering voice information and the virtual image are synchronous, the answering voice information and the virtual image are fused, and the virtual image voice interaction is more realistic.
Specifically, in this embodiment, the projection device generates an avatar, fuses the answering speech information and the avatar, and does not limit the specific way of implementing the speech interaction of the avatar, and the projection device sequentially extracts each pair of answering speech phonemes in the answering speech information and queries a preset phoneme-lip mapping relationship, where the preset phoneme-lip mapping relationship refers to a mouth shape corresponding to each phoneme according to a normal speaking standard to obtain a target lip shape matched with each pair of answering speech phonemes; the projection equipment adjusts the mouth shape of the virtual image according to the target lip shape, the answering voice information and the adjusted virtual image are synchronously output through the projection equipment so as to carry out virtual image voice interaction, the virtual image and the answering voice information are fused, the voice is matched with the expression of the virtual image, and the reality sense of man-machine interaction is enhanced.
It can be understood that, in this embodiment, obtaining a preset avatar is given, in addition, a person skilled in the art may adjust the execution steps of the technical scheme of the present disclosure, generate a virtual image in real time according to the collected user voice information and user image information, and perform voice interaction, so that the generated avatar may change along with the user image information, the generated avatar is real, generate answering voice information according to the user voice information, and fuse the avatar and the answering voice information, so that the voice is matched with the expression of the avatar, and the sense of reality of human-computer interaction is enhanced.
As shown in fig. 3, fig. 3 is a schematic flow chart illustrating an embodiment of performing voice interaction by querying a preset database with a projection device in the avatar voice interaction method according to the embodiment of the present disclosure.
In some embodiments of the present disclosure, the avatar voice interaction method includes the following steps 301-304:
301, inputting the user voice information to a preset voice recognition model to obtain voiceprint feature information, and/or inputting the user image information to a preset face recognition model to obtain face feature information.
The projection equipment is internally provided with a voice recognition model, the preset voice recognition model refers to a preset voice recognition algorithm, the projection equipment is internally provided with a face recognition model, and the preset face recognition model refers to a preset face recognition algorithm; the projection equipment inputs the user voice information into a preset voice recognition model to obtain voiceprint characteristic information; the projection equipment inputs user image information into a preset face recognition model to obtain face feature information; or the projection equipment simultaneously obtains the voiceprint characteristic information and the human face characteristic information.
302, querying a preset database, and acquiring the voiceprint feature information and/or a virtual image corresponding to the face feature information;
after the projection equipment acquires the voiceprint characteristic information and/or the face characteristic information, the projection equipment queries a preset database to acquire a virtual image corresponding to the voiceprint characteristic information and/or the face characteristic information.
Specifically, in this embodiment, before step 302, a preset database is pre-constructed in the projection device, and the virtual image of the user is stored in the preset database, so that after the voice information of the user is collected, the projection device can directly perform voice interaction by using the generated virtual image, and the step of constructing the preset database in this embodiment includes:
(1) picking up user voice information through a microphone device, and identifying the user voice information through a preset voice identification model to obtain voiceprint characteristic information;
(2) acquiring user image information through a preset camera device, and identifying the user image information through a preset face identification model to obtain face characteristic information;
(3) and constructing an avatar according to the user image information, and storing the voiceprint feature information, the face feature information and the avatar in a preset database in an associated manner.
When the projection equipment is used, firstly, account registration is carried out, and after the account registration is successful, the projection equipment picks up the voice information of the user through a microphone device, and the voice information of the user is identified through a preset voice identification model to obtain voiceprint characteristic information; the projection equipment acquires user image information through a preset camera device, and identifies the user image information through a preset face identification model to obtain face characteristic information; the projection device constructs an avatar from the user image information.
In this embodiment, the projection device pre-constructs a preset database, and the preset database stores virtual images of a plurality of accounts corresponding to users; for example, the projection device is a home theater projector, when the projection device is used, dad performs login and registration to form an avatar a, mom performs login and registration to form an avatar B, and children performs login and registration to form an avatar C; the projection equipment stores the virtual images of all users in a preset database, so that the virtual images do not need to be generated in real time during voice interaction.
303, analyzing the user voice information to determine answering text information, and performing voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
and 304, fusing the answering voice information and the virtual image and outputting the information through a projection device to perform virtual image voice interaction.
The projection equipment analyzes the user voice information to determine the answering text information, and carries out voice conversion on the answering text information according to the voiceprint characteristic information of the user voice information to generate answering voice information; and then, the projection equipment fuses the answering voice information and the virtual image and outputs the information through the projection equipment so as to carry out virtual image voice interaction.
In the embodiment, the preset database is pre-constructed, the answering voice information is fused according to the virtual image in the preset database, the virtual image is not required to be generated in real time, the hardware requirement on the projection equipment is reduced, and the voice interaction efficiency is higher.
Because the avatar is pre-stored in the preset database, after a period of time, the avatar may have a large difference from the actual avatar, so the reality of human-computer interaction is reduced, and based on this problem, the avatar in the preset database may be updated in real time in the embodiment, specifically:
referring to fig. 4, fig. 4 is a schematic flowchart illustrating an embodiment of updating an avatar in a preset database in the avatar voice interaction method provided in the embodiment of the present disclosure.
In some embodiments of the present disclosure, an embodiment scenario in which the projection device compares the user image information with an avatar in a preset database and updates the avatar is specifically described, which specifically includes steps 401 to 404:
401, comparing the virtual image with the user image information;
the projection equipment queries a preset database to obtain voiceprint characteristic information and/or a virtual image corresponding to the face characteristic information; the projection equipment analyzes the user voice information to determine answer text information, voice conversion is carried out on the answer text information according to voiceprint feature information of the user voice information, after the answer voice information is generated, the projection equipment compares the virtual image with the user image information to judge whether the virtual image is matched with the user image information, wherein the matching of the virtual image and the user image information can be set according to specific scenes, for example, the similarity between the feature data of the virtual image and the feature data of the user image information is higher than 80%, and the virtual image is judged to be matched with the user image information; and if the similarity of the feature data of the virtual image and the feature data of the user image information is not higher than 80%, judging that the virtual image is not matched with the user image information.
And if the virtual image is matched with the user image information, the projection equipment fuses the answering voice information and the virtual image and outputs the information through the projection equipment so as to carry out virtual image voice interaction.
402, if the avatar is not matched with the user image information, generating a new avatar according to the user image information.
If the virtual image is not matched with the user image information, the projection equipment generates a new virtual image according to the user image information, wherein the generation of the new virtual image can be directly realized through a model generated by the virtual image, and the projection equipment can also update the virtual image.
And 403, storing the new virtual image, the voiceprint characteristic information and the face characteristic information in a preset database in a correlated manner.
The projection equipment stores the new virtual image, the voiceprint characteristic information and the human face characteristic information in a preset database in a related mode, and the projection equipment can comprise a plurality of virtual images, so that the virtual images can be displayed in a diversified mode.
404, merging the answering voice information and the new avatar and outputting the merged information through a projection device to perform avatar voice interaction.
The projection equipment fuses the answering voice information and the new virtual image and outputs the information through the projection equipment to perform virtual image voice interaction.
In the embodiments, the description is given by using an avatar scene, in some specific use scenes, a user may need to perform voice interaction with avatars of other people, and in the embodiment, a scene in which the user performs voice interaction when a preset database includes a plurality of avatars is given.
Referring to fig. 5, fig. 5 is a schematic flow chart of an embodiment of voice interaction between a projection device and different avatars in the avatar voice interaction method provided in the embodiment of the present disclosure.
In some embodiments of the present disclosure, specifically, the projection device selects a target avatar for voice interaction according to a requirement of a user, further comprising the following steps 501 and 504:
501, outputting each virtual image stored in the preset database for the user to select the target virtual image.
The projection equipment outputs and displays a plurality of virtual images stored in a preset database, and outputs prompt information to prompt a user to select a target virtual image needing voice interaction.
502, obtaining the target virtual image selected by the user and the target voiceprint feature information associated with the target virtual image.
The projection equipment acquires a target virtual image selected by a user, acquires target voiceprint characteristic information associated with the target virtual image, and simulates the answer of the target virtual image according to the target voiceprint characteristic information, so that voice interaction is more intelligent, specifically:
503, analyzing the user voice information to determine answer text information, and performing voice conversion on the answer text information according to the target voiceprint feature information to generate answer voice information.
And 504, fusing the answering voice information and the target avatar and outputting the fused information and the target avatar through projection equipment to perform avatar voice interaction.
The projection equipment analyzes the voice information of the user to determine answering text information, carries out voice conversion on the answering text information according to the target voiceprint characteristic information and generates answering voice information, namely the projection equipment converts the answering text information into initial answering voice information, and the projection equipment adjusts the initial answering voice information according to the target voiceprint characteristic to obtain answering voice information of which the tone and the tone color accord with the target virtual image; and the projection equipment fuses the answering voice information and the target virtual image and outputs the information through the projection equipment so as to carry out virtual image voice interaction.
The projection equipment in the embodiment comprises a plurality of virtual images, and the projection equipment can select the target virtual image to perform voice interaction according to requirements, so that the interactive real feeling can be ensured, and the personalized use requirements of users are met.
Referring to fig. 6 and 7, fig. 6 is a flow chart illustrating an embodiment of an avatar interaction application scenario in the avatar voice interaction method provided in an embodiment of the present disclosure; fig. 7 is a specific scene diagram of an embodiment of avatar generation in the avatar voice interaction method provided in the embodiment of the present disclosure.
It can be understood that, in the embodiment of the present disclosure, the avatar voice interaction method can further obtain the voice characteristics and pronunciation habits of the user by deep learning the voice information of the user, so that the avatar generated by fusion is more in line with the requirements.
In the above embodiment, a description is given for a single-person speaking virtual image voice interaction scene, in some specific use scenes, the projection device may acquire voice information of a plurality of users, and in this embodiment, a scene in which the projection device performs voice interaction for the voice information of the plurality of users is given.
Referring to fig. 8, fig. 8 is a flowchart illustrating an embodiment of a projection device performing sound addressing to achieve target user interaction in the avatar voice interaction method provided in the embodiments of the present disclosure.
In some embodiments of the present disclosure, it is specifically described that the projection device performs sound addressing to achieve target user voice interaction, further comprising the following steps 601 and 603:
601, obtaining the sound source position of the user voice information according to the positions of at least two microphone devices and the acquisition time of each microphone device for acquiring the user voice information.
The projection apparatus in this embodiment may determine the sound source location according to the location of the microphone device and the collection time of the user voice information, for example, at least two microphone devices are disposed at different locations in the projection apparatus, the projection apparatus obtains the locations of the at least two microphone devices and the collection time of the same user voice information collected by each microphone, the projection apparatus calculates the time difference between the collection times, and the projection apparatus determines the sound source location of the user voice information according to the location and the time difference of the at least two microphone devices and each.
And 602, adjusting the virtual image to the sound source position, and rendering the adjusted virtual image.
The projection device adjusts the avatar toward the sound source, for example, the projection device establishes a polar coordinate system, the projection device adjusts the face orientation of the avatar according to the coordinate orientation, and the projection device adjusts the face orientation of the avatar to obtain the rendered and adjusted avatar.
603, merging the answering voice information and the adjusted virtual image and outputting the merged information through projection equipment to perform virtual image voice interaction.
And the projection equipment fuses the answering voice information and the adjusted virtual image and outputs the information through the projection equipment so as to carry out virtual image voice interaction. In the embodiment, the projection equipment can determine the position of the user according to the voice information of the user, so that the orientation of the virtual image is adjusted to face the user, and the reality of human-computer interaction of the user is increased.
In some embodiments of the present application, the avatar is pre-stored in the projection device, and since the avatar of the user may be changing continuously, the avatar may not conform to the user's avatar, and in this embodiment, a manner for automatically updating the avatar is provided.
Referring to fig. 9, fig. 9 is a schematic flowchart of an embodiment in which the projection device implements a combination of sound addressing and avatar update in the avatar voice interaction method provided in the embodiments of the present disclosure.
701, determining a sound source position according to the position of the microphone device;
702, acquiring user image information of a target user at a position corresponding to the sound source position by a camera device;
703, inputting the user image information into a preset face recognition model to obtain face feature information;
704, updating the virtual image according to the face feature information, fusing the answering voice information and the updated virtual image and outputting the fused answering voice information and the updated virtual image through projection equipment to perform virtual image voice interaction.
The projection equipment determines the position of a sound source according to the position of the microphone device; in this embodiment, the sound source position may be determined according to the time difference of sound collection, and may also be determined according to sound intensity information, tone information, and the like, a specific implementation manner of the projection device for determining the sound source position is not limited, and the projection device acquires user image information of a target user at a position corresponding to the sound source position collected by the camera device; the projection equipment inputs user image information into a preset face recognition model to obtain face feature information; and the projection equipment updates the virtual image according to the face characteristic information, fuses the answering voice information and the updated virtual image and outputs the information through the projection equipment so as to carry out virtual image voice interaction.
In the embodiment, the projection equipment combines sound addressing and virtual image updating, the projection equipment can position the sound source position according to the user voice information in the multi-person speaking scene information, and then the virtual image is updated according to the user image information at the sound source position, so that the virtual image is matched with a target user, the real-time updating of the virtual image can be met, different virtual images are displayed for different users in the projection equipment, and the reality of virtual image voice interaction is improved.
As shown in fig. 10, fig. 10 is a schematic structural diagram of an embodiment of an avatar voice interaction apparatus provided in an embodiment of the present disclosure.
In order to better implement the avatar voice interaction method in the embodiment of the present disclosure, on the basis of the avatar voice interaction method, an avatar voice interaction apparatus is further provided in the embodiment of the present disclosure, where the avatar voice interaction apparatus includes the following modules 801 and 804:
an information collecting module 801, configured to pick up voice information of a user through a microphone device;
the information adjusting module 802 is configured to analyze the user voice information to obtain answer text information, perform voice conversion on the answer text information according to voiceprint feature information of the user voice information, and generate answer voice information;
an image obtaining module 803, configured to obtain an avatar corresponding to the user voice information;
and the fusion output module 804 is used for fusing the answering voice information and the virtual image and outputting the fused information and the virtual image through projection equipment so as to perform virtual image voice interaction.
In some embodiments of the present disclosure, the information adjusting module 802 includes:
inputting the user voice information into a preset voice recognition model to obtain voice text information;
recognizing the voice text information to obtain answering text information;
and extracting voiceprint characteristic information in the user voice information, and carrying out voice conversion on the answering text information according to the voiceprint characteristic information to obtain answering voice information.
In some embodiments of the present disclosure, the fusion output module 804 includes:
sequentially extracting each pair of answering voice phonemes in the answering voice information, inquiring a preset phoneme and lip mapping relation, and obtaining a target lip matched with each answering voice phoneme;
and adjusting the mouth shape of the virtual image according to the target lip shape, and outputting the answering voice information and the adjusted virtual image synchronously through projection equipment to perform virtual image voice interaction.
In some embodiments of the present disclosure, the image obtaining module 803 includes:
inputting the user voice information into a preset voice recognition model to obtain voiceprint characteristic information;
and inquiring a preset database to obtain the virtual image corresponding to the voiceprint characteristic information.
In some embodiments of the present disclosure, the avatar voice interaction apparatus includes:
receiving an account registration request, acquiring an account identification input by a user, and acquiring user voice information and user image information corresponding to the account identification;
analyzing the user image information to obtain face feature information and stature proportion information;
inputting the face feature information and the stature proportion information into a preset three-dimensional image model to obtain three-dimensional animation information;
rendering the three-dimensional animation information to generate a virtual image;
and extracting voiceprint characteristic information of the user voice information, and storing the account identification, the voiceprint characteristic information, the face characteristic information and the virtual image in a preset database in a correlated manner.
In some embodiments of the present disclosure, the fusion output module 804 includes:
obtaining the sound source position of the user voice information according to the positions of at least two microphone devices and the acquisition time of each microphone device for acquiring the user voice information;
adjusting the virtual image to face the sound source position, and rendering the adjusted virtual image;
and fusing the answering voice information and the adjusted virtual image and outputting the fused information through projection equipment to perform virtual image voice interaction.
In some embodiments of the present disclosure, the fusion output module 804 includes:
determining a sound source position from the position of the microphone arrangement;
collecting user image information of a target user at a position corresponding to the sound source position through a camera device;
inputting the user image information into a preset face recognition model to obtain face feature information;
and updating the virtual image according to the face feature information, fusing the answering voice information and the updated virtual image and outputting the fused answering voice information and the updated virtual image through projection equipment so as to perform virtual image voice interaction.
In the embodiment of the disclosure, the virtual image voice interaction device picks up user voice information through a microphone device, analyzes the user voice information to obtain answering text information, and carries out voice conversion on the answering text information according to voiceprint characteristic information of the user voice information to generate answering voice information; acquiring an avatar corresponding to the user voice information, fusing the answering voice information and the avatar and outputting the fused answering voice information and the avatar through projection equipment to perform avatar voice interaction; in the embodiment of the disclosure, the answering voice information is generated according to the user voice information, and the virtual image is fused with the answering voice information, so that the voice is matched with the virtual image, and the reality sense of human-computer interaction is enhanced.
Fig. 11 is a schematic structural diagram of an embodiment of the projection apparatus provided in the embodiment of the present disclosure.
The projection equipment integrates any virtual image voice interaction device provided by the embodiment of the disclosure, and the projection equipment is provided with: one or more processors; a memory; and
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor for performing the steps of the avatar voice interaction method as described in any of the avatar voice interaction method embodiments above.
Specifically, the method comprises the following steps: the projection device may include components such as a processor 901 of one or more processing cores, memory 902 of one or more computer media, a power supply 903, and an input unit 904. Those skilled in the art will appreciate that the projection device configuration shown in fig. 11 does not constitute a limitation of the projection device and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 901 is a control center of the projection apparatus, connects various parts of the entire projection apparatus by using various interfaces and lines, and performs various functions of the projection apparatus and processes data by running or executing software programs and/or modules stored in the memory 902 and calling data stored in the memory 902, thereby performing overall monitoring of the projection apparatus. Optionally, processor 901 may include one or more processing cores; preferably, the processor 901 may integrate an application processor, which mainly handles operating user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 901.
The memory 902 may be used to store software programs and modules, and the processor 901 executes various functional applications and data processing by operating the software programs and modules stored in the memory 902. The memory 902 may mainly include a program storage area and a data storage area, wherein the program storage area may store application programs (such as a sound playing function, a training playing function, etc.) and the like required for operating at least one function; the storage data area may store data created according to use of the projection apparatus, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 902 may also include a memory controller to provide the processor 901 access to the memory 902.
The projection device further comprises a power supply 903 for supplying power to each component, and preferably, the power supply 903 may be logically connected to the processor 901 through a power management system, so that functions of charging, discharging, power consumption management and the like are managed through the power management system. The power supply 903 may also include any component including one or more of a dc or ac power source, a recharge power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The projection device may also include an input unit 904, where the input unit 904 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the projection apparatus may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 901 in the projection device loads the executable file corresponding to the process of one or more application programs into the memory 902 according to the following requests, and the processor 901 runs the application program stored in the memory 902, so as to implement various functions as follows:
picking up user voice information through a microphone device;
analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
acquiring an avatar corresponding to the user voice information;
and fusing the answering voice information and the virtual image and outputting the information through projection equipment to perform virtual image voice interaction.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a request, which may be stored in a computer medium and loaded and executed by a processor, or by control-related hardware.
To this end, embodiments of the present disclosure provide a computer medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like. The computer program is loaded by a processor to execute the steps of any one of the avatar voice interaction methods provided by the embodiments of the present disclosure. For example, the computer program may be loaded by a processor to perform the steps of:
picking up user voice information through a microphone device;
analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
acquiring an avatar corresponding to the user voice information;
and fusing the answering voice information and the virtual image and outputting the information through projection equipment to perform virtual image voice interaction.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed descriptions of other embodiments, and are not described herein again.
In a specific implementation, each unit or structure may be implemented as an independent entity, or may be combined arbitrarily to be implemented as one or several entities, and the specific implementation of each unit or structure may refer to the foregoing method embodiment, which is not described herein again.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
The avatar voice interaction method provided by the embodiment of the present disclosure is introduced in detail, and a specific example is applied to explain the principle and the implementation of the present disclosure, and the description of the above embodiment is only used to help understanding the method and the core idea of the present disclosure; meanwhile, for those skilled in the art, according to the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present disclosure.

Claims (10)

1. An avatar voice interaction method, applied to a projection device, the method comprising:
picking up user voice information through a microphone device;
analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
acquiring an avatar corresponding to the user voice information;
and fusing the answering voice information and the virtual image and outputting the information through projection equipment to perform virtual image voice interaction.
2. The avatar voice interaction method of claim 1, wherein said parsing said user voice message to obtain a dialog text message, and performing voice conversion on said dialog text message according to the voiceprint feature information of said user voice message to generate a dialog voice message comprises:
inputting the user voice information into a preset voice recognition model to obtain voice text information;
recognizing the voice text information to obtain answering text information;
and extracting voiceprint characteristic information in the user voice information, and carrying out voice conversion on the answering text information according to the voiceprint characteristic information to obtain answering voice information.
3. The avatar voice interaction method of claim 1, wherein said fusing and outputting said answering voice information and said avatar through a projection device for avatar voice interaction comprises:
sequentially extracting each pair of answering voice phonemes in the answering voice information, inquiring a preset phoneme and lip mapping relation, and obtaining a target lip matched with each answering voice phoneme;
and adjusting the mouth shape of the virtual image according to the target lip shape, and outputting the answering voice information and the adjusted virtual image synchronously through projection equipment to perform virtual image voice interaction.
4. The avatar voice interaction method of claim 1, wherein said obtaining an avatar corresponding to said user voice information comprises:
inputting the user voice information into a preset voice recognition model to obtain voiceprint characteristic information;
and inquiring a preset database to obtain the virtual image corresponding to the voiceprint characteristic information.
5. The avatar voice interaction method of claim 4, wherein before said querying a preset database to obtain the avatar corresponding to said voiceprint feature information, said method comprises:
receiving an account registration request, acquiring an account identification input by a user, and acquiring user voice information and user image information corresponding to the account identification;
analyzing the user image information to obtain face feature information and stature proportion information;
inputting the face feature information and the stature proportion information into a preset three-dimensional image model to obtain three-dimensional animation information;
rendering the three-dimensional animation information to generate a virtual image;
and extracting voiceprint characteristic information of the user voice information, and storing the account identification, the voiceprint characteristic information, the face characteristic information and the virtual image in a preset database in a correlated manner.
6. The avatar voice interaction method of claim 1, wherein said fusing and outputting said answering voice information and said avatar through a projection device for avatar voice interaction comprises:
obtaining the sound source position of the user voice information according to the positions of at least two microphone devices and the acquisition time of each microphone device for acquiring the user voice information;
adjusting the virtual image to face the sound source position, and rendering the adjusted virtual image;
and fusing the answering voice information and the adjusted virtual image and outputting the fused information through projection equipment to perform virtual image voice interaction.
7. The avatar voice interaction method of claim 1, wherein said fusing and outputting said answering voice information and said avatar through a projection device for avatar voice interaction comprises:
determining a sound source position from the position of the microphone arrangement;
collecting user image information of a target user at a position corresponding to the sound source position through a camera device;
inputting the user image information into a preset face recognition model to obtain face feature information;
and updating the virtual image according to the face feature information, fusing the answering voice information and the updated virtual image and outputting the fused answering voice information and the updated virtual image through projection equipment so as to perform virtual image voice interaction.
8. An avatar voice interaction apparatus, comprising:
the information acquisition module is used for picking up voice information of a user through the microphone device;
the information adjusting module is used for analyzing the user voice information to obtain answering text information, and carrying out voice conversion on the answering text information according to the voiceprint feature information of the user voice information to generate answering voice information;
the image acquisition module is used for acquiring a virtual image corresponding to the user voice information;
and the fusion output module is used for fusing the answering voice information and the virtual image and outputting the fused information and the virtual image through projection equipment so as to perform virtual image voice interaction.
9. A projection device, characterized in that:
one or more processors;
a memory; and
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the avatar voice interaction method of any of claims 1-7.
10. A computer medium having stored thereon a computer program to be loaded by a processor for performing the steps of the avatar voice interaction method of any of claims 1 to 7.
CN202110680196.3A 2021-06-18 2021-06-18 Virtual image voice interaction method and device, projection equipment and computer medium Pending CN113436602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110680196.3A CN113436602A (en) 2021-06-18 2021-06-18 Virtual image voice interaction method and device, projection equipment and computer medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110680196.3A CN113436602A (en) 2021-06-18 2021-06-18 Virtual image voice interaction method and device, projection equipment and computer medium

Publications (1)

Publication Number Publication Date
CN113436602A true CN113436602A (en) 2021-09-24

Family

ID=77756619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110680196.3A Pending CN113436602A (en) 2021-06-18 2021-06-18 Virtual image voice interaction method and device, projection equipment and computer medium

Country Status (1)

Country Link
CN (1) CN113436602A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114549706A (en) * 2022-02-21 2022-05-27 成都工业学院 Animation generation method and animation generation device
CN114693848A (en) * 2022-03-23 2022-07-01 山西灌木文化传媒有限公司 Method, device, electronic equipment and medium for generating two-dimensional animation
CN114911381A (en) * 2022-04-15 2022-08-16 青岛海尔科技有限公司 Interactive feedback method and device, storage medium and electronic device
CN115438212A (en) * 2022-08-22 2022-12-06 蒋耘晨 Image projection system, method and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110673716A (en) * 2018-07-03 2020-01-10 百度在线网络技术(北京)有限公司 Method, device and equipment for interaction between intelligent terminal and user and storage medium
CN111124123A (en) * 2019-12-24 2020-05-08 苏州思必驰信息科技有限公司 Voice interaction method and device based on virtual robot image and intelligent control system of vehicle-mounted equipment
CN111325851A (en) * 2020-02-28 2020-06-23 腾讯科技(深圳)有限公司 Image processing method and device, electronic equipment and computer readable storage medium
CN111724789A (en) * 2019-03-19 2020-09-29 华为终端有限公司 Voice interaction method and terminal equipment
CN112286366A (en) * 2020-12-30 2021-01-29 北京百度网讯科技有限公司 Method, apparatus, device and medium for human-computer interaction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110673716A (en) * 2018-07-03 2020-01-10 百度在线网络技术(北京)有限公司 Method, device and equipment for interaction between intelligent terminal and user and storage medium
CN111724789A (en) * 2019-03-19 2020-09-29 华为终端有限公司 Voice interaction method and terminal equipment
CN111124123A (en) * 2019-12-24 2020-05-08 苏州思必驰信息科技有限公司 Voice interaction method and device based on virtual robot image and intelligent control system of vehicle-mounted equipment
CN111325851A (en) * 2020-02-28 2020-06-23 腾讯科技(深圳)有限公司 Image processing method and device, electronic equipment and computer readable storage medium
CN112286366A (en) * 2020-12-30 2021-01-29 北京百度网讯科技有限公司 Method, apparatus, device and medium for human-computer interaction

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114549706A (en) * 2022-02-21 2022-05-27 成都工业学院 Animation generation method and animation generation device
CN114693848A (en) * 2022-03-23 2022-07-01 山西灌木文化传媒有限公司 Method, device, electronic equipment and medium for generating two-dimensional animation
CN114693848B (en) * 2022-03-23 2023-09-12 山西灌木文化传媒有限公司 Method, device, electronic equipment and medium for generating two-dimensional animation
CN114911381A (en) * 2022-04-15 2022-08-16 青岛海尔科技有限公司 Interactive feedback method and device, storage medium and electronic device
CN114911381B (en) * 2022-04-15 2023-06-16 青岛海尔科技有限公司 Interactive feedback method and device, storage medium and electronic device
CN115438212A (en) * 2022-08-22 2022-12-06 蒋耘晨 Image projection system, method and equipment
CN115438212B (en) * 2022-08-22 2023-03-31 蒋耘晨 Image projection system, method and equipment

Similar Documents

Publication Publication Date Title
JP7408048B2 (en) Anime character driving method and related device based on artificial intelligence
CN112379812B (en) Simulation 3D digital human interaction method and device, electronic equipment and storage medium
CN113436602A (en) Virtual image voice interaction method and device, projection equipment and computer medium
CN102332090B (en) Compartmentalizing focus area within field of view
CN112560605B (en) Interaction method, device, terminal, server and storage medium
CN109086860B (en) Interaction method and system based on virtual human
US20150155006A1 (en) Method, system, and computer-readable memory for rhythm visualization
JP2018014094A (en) Virtual robot interaction method, system, and robot
CN113760100B (en) Man-machine interaction equipment with virtual image generation, display and control functions
CN108052250A (en) Virtual idol deductive data processing method and system based on multi-modal interaction
CN109343695A (en) Exchange method and system based on visual human's behavioral standard
CN111741370A (en) Multimedia interaction method, related device, equipment and storage medium
CN109032328A (en) A kind of exchange method and system based on visual human
CN103945140A (en) Method and system for generating video captions
CN111383642A (en) Voice response method based on neural network, storage medium and terminal equipment
CN114419205B (en) Driving method of virtual digital person and training method of pose acquisition model
CN116880701A (en) Multimode interaction method and system based on holographic equipment
CN108681398A (en) Visual interactive method and system based on visual human
CN112149599B (en) Expression tracking method and device, storage medium and electronic equipment
CN113375295A (en) Method for generating virtual character interaction, interaction system, electronic device and medium
CN109087644B (en) Electronic equipment, voice assistant interaction method thereof and device with storage function
CN117370605A (en) Virtual digital person driving method, device, equipment and medium
JP2001128134A (en) Presentation device
CN114425162A (en) Video processing method and related device
CN112767520A (en) Digital human generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination