CN117612529A

CN117612529A - Virtual digital person interaction method and device

Info

Publication number: CN117612529A
Application number: CN202311612001.7A
Authority: CN
Inventors: 张金山; 张桓; 尹建伟
Original assignee: Zhejiang University ZJU; School of Software Technology of ZJU
Current assignee: Zhejiang University ZJU; School of Software Technology of ZJU
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-02-27

Abstract

The invention discloses a virtual digital person interaction method and device, which comprises initializing a semantic incomplete flag bit and collecting voice and image, detecting whether effective user voice after noise reduction is voice by using a voice activity detection algorithm, converting the voice of a user in a client into characters, preprocessing according to the semantic incomplete flag bit, judging whether the saved text semantic is complete, carrying out stream reply generation or generating a question aiming at the semantic completion, and adjusting the semantic incomplete flag bit; generating voice, adding a to-be-generated list of the virtual digital person and generating continuous frame pictures, and generating talking virtual head body continuous frame pictures or silent virtual head body continuous frame pictures; the generated images and speech are transmitted to the client for presentation and re-circulation from the acquisition phase. The invention can reduce the false recognition rate of noise, ensure the complete semantics of the user sentences and improve the generation speed of the voice.

Description

Virtual digital person interaction method and device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a virtual digital human interaction method and device.

Background

With the development of artificial intelligence, the concept of virtual digital people has developed. A virtual digital person refers to a virtual character of a digitized appearance that appears in the screen of a computer or mobile device as an emulated 2D or 3D image of the person. These virtual digital people are widely used in many fields including games, virtual reality, human-machine interaction, etc.

However, although the virtual digital person rendering method is endless, there is a problem of insufficient realism in terms of interaction with the user. Users often cannot truly experience natural, near-real interactions with virtual digital people, resulting in them feeling false and unrealistic to the existence of virtual digital people. Therefore, improving the sense of realism of a virtual digital person is one of the problems currently in need of solution.

In the interactive scenario of virtual digital people, there are the following problems: 1. the virtual digital person will react to any human voice received by the microphone within the scene, and sometimes words that are not spoken by the virtual digital person will also be received and replied to by the virtual digital person, which may result in a poor user experience; 2. the voice generation of the virtual digital person is generally performed according to the whole text, the speed of voice generation is reduced due to the increase of the text length, the response time of the virtual digital person is increased, and the response delay of the user receiving the virtual digital person is increased.

Disclosure of Invention

The invention aims to provide a virtual digital person interaction method and device aiming at the defects of the existing virtual digital person interaction method.

The invention is realized by the following technical scheme: the invention discloses an interaction method of virtual digital people, which comprises the following steps:

s1, initializing a semantic incomplete flag bit;

s2, performing virtual digital human interaction circulation, wherein the method specifically comprises the following steps:

s2.1, sound collection and image collection, wherein whether the sound collected at the moment is effective user sound or not is judged through a human eye gazing screen detection algorithm, and noise reduction is carried out on the effective user sound;

s2.2, detecting whether the effective user sound after noise reduction is voice or not by using a voice activity detection algorithm until the duration of the detected voice after being spliced with the voice accumulated before is more than or equal to the identification threshold value;

s2.3, recognizing the voice fragments larger than the recognition threshold, if the voice fragments can be recognized, storing the voice as a text, and if the voice fragments cannot be recognized, discarding the voice and ending the cycle;

s2.4, preprocessing according to the incomplete flag bit of the semantic, judging whether the saved text semantic is complete, generating streaming reply to the voice with the complete semantic, cutting off the punctuation, and setting the incomplete flag bit of the semantic as false; generating a question aiming at the complete semantic meaning for the voice without complete semantic meaning, performing punctuation truncation and setting a semantic incompleteness flag bit as a wire;

s2.5, performing voice generation on the text after punctuation cutting, adding the text into a list to be generated of the virtual digital person, performing continuous frame picture generation according to the list to be generated, and generating a talking virtual head body continuous frame picture or a silent virtual head body continuous frame picture;

s2.6, transmitting the generated images and voice to a client for display, and turning to the step S2.1 for sound collection and image collection again.

Further, the initialization semantic incomplete flag bit is: the semantic incomplete flag bit is set to false.

Further, the recognition threshold is 0.6 seconds.

Further, in S2.4, preprocessing according to the semantic incomplete flag bit specifically includes:

taking the text saved in the step S2.3 as a first text; if the semantic incomplete mark is True, merging the first text with the previous third text to serve as a second text; if the incomplete flag bit of the semantic meaning is False, the first text is used as the second text;

further, in S2.4, it is determined whether the saved text semantics are complete or not:

judging whether the semantics of the second text is complete or not by using a full duplex sentence breaking module, if the semantics are incomplete, the meaning expressed by a user cannot be known, generating a question sentence aiming at the complete semantics and saving the question sentence to a fifth text, and simultaneously, cutting the fifth text by using a punctuation mark cutting algorithm to obtain a plurality of sixth texts, saving the second text to a third text, and setting a semantic incomplete mark as True;

if so, setting a semantic incomplete mark as False to indicate that the semantic is complete, storing the second text into a fourth text, and generating a stream reply to the fourth text by using a language big model to obtain a fifth text, and cutting when punctuation marks are encountered to obtain a cut sixth text.

Further, in S2.5, the speech generation is performed on the text after the punctuation is cut off, and the list to be generated of the virtual digital person is added specifically as follows: and rapidly generating the sixth text by using an acoustic model FastSpecch and a vocoder HifiGAN to obtain the speech of the virtual digital person, and immediately inputting the generated speech to a list queue to be generated of a virtual digital person generating module after each generation.

Further, the step of generating continuous frame pictures according to the list to be generated in S2.5 specifically includes: acquiring input speech of a virtual digital person from a list to be generated, if the list is not empty, mapping each frame of speech to a phoneme vector through a wav2vec2.0 model, and combining other parameters as input of a nerve radiation field to generate continuous frame pictures to obtain continuous frame pictures of the head and the body of the virtual digital person speaking; if the list is empty, the mute frame phoneme vector is input into the nerve radiation field, and continuous frame pictures of the head and the body of the virtual digital person, which are not speaking when the mouth is closed, are obtained.

Further, in S2.6, transmitting the generated image and voice to the client for display specifically includes: each frame of image generated in the step S2.5 is packed and integrated through a JSON format each time, wherein a first frame of image of the virtual digital person speaking is packaged and integrated, and a sixth text, a virtual digital person speaking voice duration and image content are stored by keys in the JSON; if the picture is not the first frame image of the virtual digital person speaking or is the image of the virtual digital person waiting for the silent state, only the picture content and keys thereof exist in the JSON, and the keys of other contents do not exist, and the packed JSON format data is transmitted to the client in a Socket mode;

when the client receives the transmitted data, decoding a JSON packet, judging whether the transmitted data is the first frame of the virtual digital person speaking according to the existence of a virtual digital person speaking voice key, if so, performing image playing on the picture content through a QLabel and QPixmap component of PyQT6, displaying the time length of a reply user text for the virtual digital person speaking voice time length through the QLabel, and performing playing on the virtual digital person speaking voice for the virtual digital person speaking voice time length through a mixer component of a pygame packet, and when the virtual digital person speaking voice time length is reached, clearing the displayed reply user text; and if the frame is not the first frame, performing image playing on the image content.

According to another aspect of the specification, there is provided an apparatus for implementing a virtual digital person interaction method, the apparatus comprising: the system comprises a voice acquisition module, an image acquisition module, a voice recognition module, a full duplex sentence breaking module, a language big model module, a voice generation module, a virtual digital person generation module and a client display module; the voice acquisition module is used for acquiring voice frequency of a user, the image acquisition module is used for acquiring face images of the user, the voice recognition module is used for converting the voice frequency of the user into characters, the full-duplex sentence breaking module is used for judging whether the semantics of the user are complete or not, the language big model module is used for replying to the questions posed by the user to obtain answers generated in a streaming mode, the voice generation module is used for generating voice of a virtual digital person, the virtual digital person generation module is used for generating real-time head body images of the virtual digital person, and the client display module is used for displaying the images of the virtual digital person and playing the voice of the virtual digital person.

According to another aspect of the specification, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a virtual digital human interaction method.

The beneficial effects of the invention are as follows:

1. the human eye gazing screen detection algorithm is used for detecting whether a user gazes at the virtual digital person in the screen, and when the user does not look at the virtual digital person, the virtual digital person does not respond to any sound, so that the false recognition rate of noise can be reduced;

2. the full duplex sentence breaking module judges that the semantics are complete, so that the error understanding rate of a language big model behind a virtual digital person to incomplete sentences can be reduced, the semantic completeness of user sentences is ensured, and the virtual digital person can better answer the questions proposed by the user;

3. when the large language model is used for stream reply, the punctuation mark cutting method is used, so that the input sentence of the voice generated at a time is shorter, the voice generation speed can be improved, and the response time of a virtual digital person is reduced;

4. and continuous frame pictures generated by the virtual digital person are transmitted to the client side display front end in a data real-time transmission mode, so that response delay of the virtual digital person can be reduced.

Drawings

FIG. 1 is an interactive cycle flow chart of a virtual digital human interaction method provided by an embodiment of the invention;

FIG. 2 is a block diagram of a speech acquisition implementation provided by an embodiment of the present invention;

FIG. 3 is a block diagram of a speech recognition implementation provided by an embodiment of the present invention;

fig. 4 is a block diagram of a full duplex sentence breaking module according to an embodiment of the present invention for determining whether a semantic meaning is complete;

FIG. 5 is a block diagram of a language big model module generation reply implementation provided by an embodiment of the present invention;

FIG. 6 is a block diagram of a speech generation implementation provided by an embodiment of the present invention;

FIG. 7 is a block diagram of a virtual digital person generation implementation provided by an embodiment of the present invention;

fig. 8 is a block diagram of a data transfer and client presentation implementation according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

The invention mainly solves the interaction method of virtual digital people by the following ways: the invention uses the voice recognition module to convert the voice of the user into characters, inputs the converted characters into the full duplex sentence breaking module to judge whether the user sentence is complete or not, inputs the converted characters into the language big model module if the user sentence is complete, records and inquires the user speaking words until the user sentence is complete if the user sentence is incomplete; after the reply text or the query text aiming at semantic integrity is obtained, the speech generation module is used for obtaining the speaking voice of the virtual digital person, the speaking voice is transmitted into the virtual digital person generation module to obtain continuous frame pictures of the head and the body of the virtual digital person speaking, and finally the reply text of the language big model, the speaking voice of the virtual digital person, the speaking voice length of the virtual digital person and the continuous frame pictures of the speaking of the virtual digital person are transmitted to the client display module, and the corresponding content is displayed for a user to watch, so that the effect of real-time interaction between the user and the virtual digital person is achieved.

The virtual digital human interaction method provided by the embodiment comprises the following steps:

s1, initializing a semantic incomplete flag bit to false;

s2, performing virtual digital human interaction circulation, as shown in FIG. 1, specifically:

s2.1, sound collection and image collection, judging whether the sound collected at the moment is effective user sound or not through a human eye gazing screen detection algorithm, and reducing noise of the effective user sound; the method comprises the following steps:

starting a recording function by using a pyaudio packet in the client, and continuously monitoring sound; meanwhile, a cv2 package is used for starting a camera video recording function, user pictures with the frame rate of 25FPS are continuously captured, 68 face coordinates expressed by 3DMM are used for judging the pictures captured by the camera through a human eye gazing screen detection algorithm realized by detecting the center direction of human eyes, if the human eyes are gazing at virtual digital people in a screen, sound monitored by the video recording function is effective user sound, and otherwise, the sound is ineffective user sound; denoising the effective user sound of each frame (20 ms) by using a denoising algorithm realized by a noisereduce packet to obtain the denoised effective user sound;

s2.2, detecting whether the effective user sound after noise reduction is voice or not by using a voice activity detection algorithm until the duration after the detected voice is spliced with the voice accumulated before is more than or equal to an identification threshold value, wherein the voice activity detection algorithm specifically comprises the following steps:

performing two-layer nested detection by using a voice activity detection algorithm realized by a webrtmvad packet and a voice activity detection algorithm realized by a pynnote packet, judging whether the effective user voice after noise reduction in S2.1 is a voice of a human speaking, if so, inputting the voice of the human speaking after the concatenation into a voice recognition module if the voice of the human speaking is the voice of the human speaking and the time length of the voice after the voice of the human speaking which is accumulated and received before is spliced is more than or equal to 0.6 seconds; if not, continuing to monitor and record until the voice activity detection algorithm detects that the accumulated and spliced voice of the person speaking is greater than or equal to 0.6 seconds, as shown in fig. 2;

s2.3, recognizing the voice fragments larger than the recognition threshold, if the voice fragments can be recognized, storing the voice as a text, and if the voice fragments cannot be recognized, discarding the voice and ending the cycle, specifically:

the voice recognition module utilizes a Conformer deep learning model to recognize the voice of the effective person speaking obtained in the step 1.2, if the voice can be recognized as a text, the text is saved as a first text, and the first text is input into the full duplex sentence-breaking module; if not, discarding the voice of the current speaker, namely exiting the cycle, as shown in fig. 3;

s2.4, preprocessing according to the incomplete flag bit of the semantic, judging whether the saved text semantic is complete, generating streaming reply to the voice with the complete semantic, cutting off the punctuation, and setting the incomplete flag bit of the semantic as false; generating a question aiming at complete semantics for the voices without complete semantics, performing punctuation and cutting off and setting a flag bit with incomplete semantics as wire, wherein the method specifically comprises the following steps:

if the semantic incomplete mark is True, merging the first text with the previous third text to serve as a second text; if the incomplete flag bit of the semantic meaning is False, the first text is used as the second text;

judging whether the semantics of the second text is complete or not by utilizing a BERT Chinese basic model based on a Transformer to judge the fine tuning of the semantic integrity, if not, obtaining the meaning expressed by a user, generating a question sentence aiming at the complete semantics by using a GPT algorithm and saving the question sentence to a fifth text, and simultaneously, cutting off the fifth text by using a punctuation mark cutting algorithm to obtain a plurality of sixth texts, saving the second text to the third text, setting a semantic incompleteness mark as True, directly inputting the plurality of sixth texts in sequence for voice generation without stream type reply generation;

if so, setting a semantic incomplete flag as False to indicate that the semantic is complete without asking again, and storing the second text into a fourth text for stream reply generation, as shown in fig. 4;

the language big model module uses the ChatGLM-6b model to carry out stream reply generation on the fourth text to obtain a fifth text, cuts every time punctuation marks are encountered due to stream generation, and carries out voice generation on the cut sixth text, as shown in fig. 5;

and (3) voice generation: the voice generation module rapidly generates voice of the sixth text by utilizing an acoustic model FastSpecch and a vocoder HifiGAN to obtain virtual digital person speaking voice, and the generated speaking voice is immediately input into a list to be generated queue of the virtual digital person generation module after each generation is finished, as shown in figure 6;

s2.5, performing voice generation on the text after punctuation cutting, adding the text into a list to be generated of the virtual digital person, performing continuous frame picture generation according to the list to be generated, and generating talking virtual head body continuous frame pictures or silent virtual head body continuous frame pictures, wherein the method specifically comprises the following steps of:

acquiring input virtual digital person speaking voice in a list to be generated by utilizing a pre-trained meshed nerve radiation field, if the list is not empty, acquiring voice, mapping each frame of voice onto a phoneme vector through a wav2vec2.0 model, and combining other parameters as input of the nerve radiation field to generate continuous frame pictures to obtain continuous frame pictures of the head and the body of the virtual digital person speaking; if the list is not obtained, that is, the list is empty, the task that the virtual digital person does not speak at the moment, that is, the silent state of being in standby at the moment, the silence frame phoneme vector is input into the nerve radiation field, and continuous frame pictures of the head and the body of the virtual digital person, which are not speaking when the mouth is closed, are obtained, as shown in fig. 7;

Each generated frame of image is packed and integrated through a JSON format every time, and if the frame of image is the first frame of image of the virtual digital person speaking generated in the step six, a sixth text, the virtual digital person speaking voice duration and the image content are stored by keys in the JSON; if the frame of picture is not the first frame of picture of the virtual digital person speaking or is the picture of the virtual digital person in a standby silent state, only picture content and keys thereof exist in the JSON, and keys of other content do not exist, and the packed JSON format data is transmitted to a client display module in a Socket mode;

when JSON format data is received, decoding a JSON packet, judging whether the frame is the first frame of the virtual digital person speaking according to the existence of a virtual digital person speaking voice key, if yes, performing image playing on picture content through a QLabel and QPixmap component of PyQT6, displaying a reply user text with the duration being the speaking voice duration of the virtual digital person through the QLabel, and performing playing with the duration being the speaking voice duration of the virtual digital person through a mixer component of a pygame packet, and when the speaking voice duration of the virtual digital person is reached, clearing the displayed reply user text; if not, the image content is image-played as shown in fig. 8.

Another embodiment of the present invention also discloses an apparatus, including: the voice acquisition module is used for acquiring the voice frequency of speaking of the user; the image acquisition module is used for acquiring face images of users; the voice recognition module is used for recognizing the voice of the effective person speaking obtained in the step 1.2 by using a Conformer deep learning model and converting the voice into characters; the full duplex sentence breaking module is used for judging whether the user semantics are complete; the language big model module is a ChatGLM-6b model and is used for replying to the problem proposed by the user; the voice generation module comprises an acoustic model FastSpecch and a vocoder HifiGAN and is used for generating voice of a virtual digital person speaking; the virtual digital person generating module is used for generating real-time head and body images of the virtual digital person; the client display module is written based on PyQT6 and used for displaying the virtual digital person image and playing the virtual digital person voice.

Another embodiment of the present invention also discloses a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the virtual digital human interaction method described above.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made thereto are within the spirit of the invention and the scope of the appended claims.

Claims

1. A method of virtual digital human interaction, the method comprising the steps of:

s1, initializing a semantic incomplete flag bit;

s2.1, sound collection and image collection, judging whether the sound collected at the moment is effective user sound or not through a human eye gazing screen detection algorithm, and reducing noise of the effective user sound;

2. The interaction method of virtual digital people according to claim 1, wherein the initialization semantic incomplete flag bit is: the semantic incomplete flag bit is set to false.

3. The method of claim 1, wherein the recognition threshold is 0.6 seconds.

4. The interaction method of virtual digital people according to claim 1, wherein in S2.4, preprocessing according to a semantic incomplete flag bit specifically includes:

taking the text saved in the step S2.3 as a first text; if the semantic incomplete mark is True, merging the first text with the previous third text to serve as a second text; and if the semantic incomplete flag bit is False, the first text is used as the second text.

5. The method for interaction of virtual digital people according to claim 1, wherein in S2.4, determining whether the saved text semantics are complete is specifically:

6. The method for interaction of virtual digital people according to claim 5, wherein in S2.5, the voice generation of the text after punctuation and the addition of the list to be generated of the virtual digital people are specifically as follows: and rapidly generating the sixth text by using an acoustic model FastSpecch and a vocoder HifiGAN to obtain the speech of the virtual digital person, and immediately inputting the generated speech to a list queue to be generated of a virtual digital person generating module after each generation.

7. The method for virtual digital person interaction according to claim 1, wherein the step of generating successive frame pictures according to the list to be generated in S2.5 is specifically: acquiring input speech of a virtual digital person from a list to be generated, if the list is not empty, mapping each frame of speech to a phoneme vector through a wav2vec2.0 model, and combining other parameters as input of a nerve radiation field to generate continuous frame pictures to obtain continuous frame pictures of the head and the body of the virtual digital person speaking; if the list is empty, the mute frame phoneme vector is input into the nerve radiation field, and continuous frame pictures of the head and the body of the virtual digital person, which are not speaking when the mouth is closed, are obtained.

8. The method for interaction of virtual digital people according to claim 6, wherein in S2.6, transmitting the generated image and voice to the client for presentation specifically comprises: each frame of image generated in the step S2.5 is packed and integrated through a JSON format each time, wherein a first frame of image of the virtual digital person speaking is packaged and integrated, and a sixth text, a virtual digital person speaking voice duration and image content are stored by keys in the JSON; if the picture is not the first frame image of the virtual digital person speaking or is the image of the virtual digital person waiting for the silent state, only the picture content and keys thereof exist in the JSON, and the keys of other contents do not exist, and the packed JSON format data is transmitted to the client in a Socket mode;

9. An apparatus for carrying out the method of any one of claims 1-8, the apparatus comprising: the system comprises a voice acquisition module, an image acquisition module, a voice recognition module, a full duplex sentence breaking module, a language big model module, a voice generation module, a virtual digital person generation module and a client display module; the voice acquisition module is used for acquiring voice frequency of a user, the image acquisition module is used for acquiring face images of the user, the voice recognition module is used for converting the voice frequency of the user into characters, the full-duplex sentence breaking module is used for judging whether the semantics of the user are complete or not, the language big model module is used for replying to the questions posed by the user to obtain answers generated in a streaming mode, the voice generation module is used for generating voice of a virtual digital person, the virtual digital person generation module is used for generating real-time head body images of the virtual digital person, and the client display module is used for displaying the images of the virtual digital person and playing the voice of the virtual digital person.

10. A computer readable storage medium, on which a program is stored, characterized in that the program, when being executed by a processor, implements a virtual digital human interaction method according to any of claims 1-8.