CN117456064A

CN117456064A - Method and system for rapidly generating intelligent companion based on photo and short audio

Info

Publication number: CN117456064A
Application number: CN202311228650.7A
Authority: CN
Inventors: 汪琪; 徐长志; 陈萍; 高培培; 陈辉; 单国栋; 施道平; 杨进
Original assignee: Jiangsu Haobai Technology Co ltd
Current assignee: Jiangsu Haobai Technology Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2024-01-26

Abstract

The invention provides a method and a system for rapidly generating an intelligent companion based on photos and short audio, and relates to the technical field of digital video application. The method adopts a facial feature mapping algorithm and a multi-resolution closed mask method to complete the matching of key features of the action guiding video and the character image and the behavior prediction of the non-portrait region and the restoration of the missing region, so as to achieve effective feature fusion, intelligently synthesize the character video, realize rapid and low-cost character re-engraving, and save the creation cost and time of digital character images such as video recording, character action capturing and the like. And (3) using a time reverse splicing method to process the video segments according to a time axis in reverse order to generate the character image video, so as to realize smooth transition of the video connection part. In addition, the invention realizes the generation of the reply text combining the context and the previous dialogue history by accessing the intelligent chat robot tool, and provides more coherent, accurate and targeted reply.

Description

Method and system for rapidly generating intelligent companion based on photo and short audio

Technical Field

The invention relates to the technical field of digital person video application, in particular to a method and a system for rapidly generating an intelligent companion based on photos and short audio.

Background

With the development of modern society, many families face the problem that parents cannot accompany the growth of children for a long time due to working pressure. To solve this problem, many robot products for accompanying children are on the market. These robots utilize artificial intelligence and machine learning techniques to simulate human actions, speech and expressions, interact with children, and provide accompaniment and entertainment. However, the current accompanying robot cannot completely copy the image and sound of a parent, cannot simulate the emotion and expression of the parent, and is difficult to cause the interest and emotion resonance of children.

Therefore, how to improve the use experience of the accompanying robot, the user can quickly and truly reproduce the image and sound of the parent on the basis of providing simple static photos (such as credentials photo and the like) and short audio samples, the reproduction cost of the image and sound of the parent is reduced, the accompanying robot is enabled to attract the interests of children more, the accompanying function is fully exerted, and the method becomes an important research direction.

Disclosure of Invention

The invention aims to: a method and a system for rapidly generating intelligent chaperones based on photos and short audio are provided, so as to solve the problems in the prior art.

In a first aspect, a method for quickly generating an intelligent companion based on a photo and a short audio is provided, which comprises the following steps:

step S1, extracting key point position feature vectors of a user figure according to a picture uploaded by a user, and generating a user figure dynamic video according to the key point position feature vectors; wherein, the user character dynamic video is a character dynamic change sequence of guiding the video according to the preset action by the user photo;

s2, extracting voiceprint features of a user according to short audio uploaded by the user, and modeling aiming at the voiceprint features to form a user-specific tone conversion model;

s3, acquiring a voice stream file input by an interactor, importing the voice stream file into an intelligent chat API interface, and acquiring a reply text;

step S4, importing the reply text obtained in the step S3 into the user-specific tone conversion model in the step S2, and generating reply audio with the user tone characteristics;

and step S5, synthesizing a lip-form and pronunciation matched reply video based on the reply audio file generated in the step S4 and the character image dynamic change sequence generated in the step S1, and outputting the reply video to an interactor.

In a further embodiment of the first aspect, according to a photograph uploaded by a user, extracting a key point feature vector of a user figure, and generating a user figure dynamic video according to the key point feature vector, specifically including the following steps:

s11, acquiring a photo containing a front face full-face image of a user;

step S12, carrying out character image detection based on an S3FD algorithm, detecting a character image area in a photo, and cutting to obtain user character image rectangular frame selection data;

and S13, guiding the video through a preset action, executing character gesture migration operation on the character rectangular frame selection data of the user, generating a plurality of character dynamic change sequences, and forming the character dynamic video of the user by the plurality of character dynamic change sequences.

In a further embodiment of the first aspect, the video is guided through a preset action, character gesture migration operation is performed on the user character rectangular frame selection data, a plurality of character dynamic change sequences are generated, and the plurality of character dynamic change sequences form the user character dynamic video, and specifically the method comprises the following steps:

presetting an action guide video, wherein the action guide video is a figure dynamic video recorded by a real model or an interview video containing figure actions;

and synchronously inputting the action guiding video and the user character rectangular frame selection data into an image action transformation model, and finally obtaining the user character dynamic video.

Finally obtaining the dynamic video of the user figure image through key feature extraction and matching, feature data weight calculation, model reasoning key points and the like, and adopting a time reverse splicing method to enable the joint of the video to be smoother and the transition to be more natural.

The time reverse splicing method specifically comprises the following steps: decoding the dynamic video of the user figure image into a series of image frames, and storing the image frames in a list A in sequence; storing the image frames in a list B in the reverse order, if the video has 25 frames, then putting the 1 st frame into the last bit, putting the 2 nd frame into the penultimate bit, and the like, and finally putting the original last frame into the first bit; and splicing the image frame sequential storage list A and the reverse storage list B to form a list C, and recoding the image frames in the list C to output a new video file.

Extracting key points of the character image area by using a character image detector, calculating a key point descriptor, and realizing the matching of the action guiding video and the key characteristics of the character image;

and predicting the behavior of the non-portrait region through the image motion transformation model so as to achieve an effective feature fusion result, and finally obtaining the portrait dynamic video matched with the portrait cropping image of the user photo. The multi-resolution occlusion mask method is to better control which part is preserved or occluded during image restoration and synthesis to generate a high quality resulting image.

In the image Motion transformation Model training process, in order to alleviate the influence of using single shielding masks (different resolution feature images focus on different emphasis), low resolution feature images focus on abstractions, and high resolution feature images focus on details) on training results by using different resolution shielding masks for different scale feature images, a method of using corresponding resolution shielding masks for different scale feature images alone, namely a multi-resolution closed mask method, is adopted by referring to a TPSMM algorithm (Thin-Plate-Spline-Motion-Model) to obtain better prediction effects.

In a further embodiment of the first aspect, according to the short audio uploaded by the user, user voiceprint features are extracted and modeled for the voiceprint features to form a user-specific timbre conversion model, and the method specifically includes the following steps: and extracting voiceprint characteristics of the short audio uploaded by the user by using a Hubert coding model, binding voiceprint characteristic information with a vocoder to perform countermeasure training, and generating a timbre conversion reasoning model with the user voiceprint characteristics.

In a further embodiment of the first aspect, a voice stream file input by an interactor is obtained, the voice stream file is imported into an intelligent chat API interface, and a reply text is obtained, which specifically includes the following steps:

s31, the system collects a problem voice file proposed by an interactor;

s32, calling a Whisper voice-to-text engine interface, and converting an interactive question of an audio version into an interactive question of a text version;

and step S33, inputting the text version interactors question to an intelligent chat robot tool, wherein the intelligent chat robot tool gives a proper answer text through context and history interaction.

In a further embodiment of the first aspect, the reply text obtained in the step S3 is imported into the user-specific timbre conversion model in the step S2, and reply audio with user timbre characteristics is generated, which specifically includes the following steps:

step S41, obtaining the reply text generated in the step S33;

step S42, calling a text-to-speech (TTS) synthesis interface to convert the reply text into an audio stream file corresponding to the reply text; according to the reply text, selecting parameters of the sex, the volume and the speed of sound of the sounder, calling a text-to-speech (TTS) synthesis interface, and converting the reply text into an audio stream file corresponding to the reply text;

and step S43, inputting the audio stream file into a tone conversion reasoning model to generate a reply audio consistent with the tone of the user in a reasoning way. Inputting the audio stream file into a tone color conversion reasoning model, selecting a configuration file and pitch size parameters corresponding to the reasoning model, and reasoning and synthesizing reply audio consistent with tone color.

In a further embodiment of the first aspect, based on the reply audio file generated in the step S4 and the character image dynamic change sequence generated in the step S1, synthesizing a reply video with lip matching with pronunciation, and outputting the reply video to the interactor, specifically including the following steps:

reading the user figure dynamic video generated in the step S1, traversing each frame of image in the video, extracting the figure feature vector of each frame one by one, reading the reply audio with the user tone generated in the step S4, extracting audio frequency spectrum data, calling a lip shape action migration algorithm, and fusing to generate a reply video file with consistent lip shape and pronunciation. Meanwhile, the display effect of the reply video is improved by using the repair amplifying tool, and the figure image enhancement repair is carried out on the reply video. And extracting each frame of the reply video, enhancing and repairing the detected human images frame by frame, providing a plurality of repairing schemes such as clearness, nature, retaining details and the like for human image repairing, performing super-processing on a non-human image area in each frame, and finally splicing the processed video frames into the high-image-quality reply video through ffmpeg.

A second aspect of the present invention proposes an intelligent companion quick generation system, the system comprising:

the character construction module is used for extracting key point position feature vectors of the character of the user according to the pictures uploaded by the user and generating a dynamic video of the character of the user according to the key point position feature vectors; wherein, the user character dynamic video is a character dynamic change sequence of guiding the video according to the preset action by the user photo;

the voiceprint construction module is used for extracting voiceprint characteristics of a user according to short audio uploaded by the user and modeling aiming at the voiceprint characteristics to form a user-specific tone conversion model;

the reply text acquisition module is used for acquiring a voice stream file input by an interactor, importing the voice stream file into the intelligent chat API interface and acquiring a reply text;

the reply audio construction module is used for importing the reply text acquired by the reply text acquisition module into the voiceprint construction module to generate reply audio with the tone characteristics of the user;

and the output module synthesizes a reply video with lip shape matched with pronunciation based on the reply audio file and the character image dynamic change sequence and outputs the reply video to an interactor.

The invention has the following beneficial effects:

1. the invention adopts a facial feature mapping algorithm and a multi-resolution closed mask method to complete the matching of key features of the action guiding video and the character image and the behavior prediction of the non-portrait region and the restoration of the missing region, thereby achieving effective feature fusion, intelligent synthesis of the character video, realization of rapid and low-cost character re-engraving, and saving of the creation cost and time of digital character images such as video recording, character action capturing and the like.

2. According to the invention, the video clips are processed in reverse order according to the time axis by using a time reverse splicing method to generate the character image video, so that the smooth transition of the video connection part is realized.

3. The invention realizes the generation of the reply text combining the context and the previous dialogue history by accessing the intelligent chat robot tool, and provides more coherent, accurate and targeted reply.

4. According to the invention, by extracting the voiceprint feature vector of the user and training the timbre conversion reasoning model, quick and low-cost character sound reproduction is realized, the language expression requirement of the user and the intelligent voice synthesis threshold are reduced, the personalized user experience is realized, the interactive communication effect is improved, and the emotion connection between the intelligent companion and the interactors is enhanced.

5. According to the invention, the complex and personalized character image design, modeling and interaction problems are converted into the low-cost and low-threshold action migration and sound reproduction problems through the lip action migration algorithm, the facial feature mapping algorithm, the intelligent semantic generation algorithm and the intelligent voice synthesis algorithm, so that the user video recording process is omitted, the user participation degree of user image and sound reproduction is reduced, the generation cost and the use cost of intelligent companion are reduced, modeling manpower and material resources are saved, the intelligent digital body-separation manufacturing tends to be convenient and simple, and the intelligent companion can enter into thousands of households.

Drawings

Fig. 1 is a general flow chart of the method of the present invention.

FIG. 2 is a flow chart of the method of the present invention for reproducing a character video based on a user's uploaded photograph.

FIG. 3 is a flow chart of a method of the present invention for training a sound conversion inference model based on user uploaded audio.

FIG. 4 is a flow chart of the method of the present invention for generating reply text based on the voice stream file input by the interactor in combination with the intelligent chat robot tool.

FIG. 5 is a flow chart of synthesizing reply audio with user tone characteristics according to reply text and tone conversion inference model in the method of the invention.

Fig. 6 is a flow chart of generating a high-quality lip-consistent reply video according to character images and reply audio with user tone characteristics in the method of the invention.

Fig. 7 is a schematic structural diagram of the intelligent companion quick generation system of the present invention.

Description of the embodiments

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.

The research of the applicant finds that the current accompanying robot cannot completely copy the image and sound of a parent, cannot simulate the emotion and expression of the parent, and is difficult to cause the interest and emotion resonance of children.

Therefore, the applicant provides a low-cost and high-efficiency method for reproducing the images and sounds of parents, which solves the problems of high cost of generating the virtual images, time and labor consumption for accessing the dynamic capturing equipment, difficulty and high price of cloning the sounds of the individual users, and the like, and simultaneously accesses an intelligent question answering tool, so that various problems presented by interactors can be answered in real time by the images and sounds of the parents, and more comprehensive and more attractive accompanying service is provided.

The method aims at carrying out character image detection and key point feature extraction on a picture uploaded by a user, and carrying out feature matching by comparing character image with key point description information of a set action guiding video so as to re-inscribe the character image. The method can save the video recording process of the user, save the video shooting cost, reduce the use threshold of the product and provide an efficient and convenient way for creating the video content.

The second purpose of the method is to integrate intelligent voice synthesis technology, quickly extract voice print characteristic information of a user through short audio uploaded by the user, train a tone conversion reasoning model, etch character sound with low cost and high efficiency, promote interactive experience of the user and a companion robot, enhance relativity and emotion connection of interactors, and enable the interactors to feel more natural and close to communication with the robot.

As shown in fig. 1, the present application provides a method for quickly generating an intelligent companion based on a photo and a short audio, and the specific generation process is as follows:

when a user uses the device for the first time, a character image and a sound model are required to be set, namely, a photo containing the face of the user and a section of speaking audio of the user are uploaded. It should be noted that, in some embodiments, initializing multiple users is supported, but the characters in the uploaded photos of the users are in one-to-one correspondence with the audio files uploaded by the users, so as to generate dedicated images and sound models respectively.

S1, extracting user figure features according to a photo uploaded by a user, and generating a user figure video according to the extracted feature vector; the user character video is a character dynamic change sequence of guiding the video according to a preset action by the user photo.

In some embodiments, the method generates the user character dynamic video according to the user uploading photo, as shown in fig. 2, and specifically includes the following steps: s11, obtaining a photo uploaded by a user; s12, inputting the photo uploaded by the user into a character detection model, extracting character key point feature vectors, and cutting to obtain character region data; s13, inputting preset action guiding videos and cut character frame selection data into an image action transformation model, and calculating and matching key point descriptors to obtain a character dynamic time sequence; and S14, performing time reverse splicing processing on the character dynamic time sequence obtained by model reasoning to obtain a final character dynamic video.

Specifically, step S13 further includes: and extracting key points of the character image area by using a character image detector, calculating a key point descriptor, and realizing the matching of the action guiding video and the key characteristics of the character image.

And predicting the behavior of the non-portrait region based on a multi-resolution closed mask method through the image motion transformation model, and finally obtaining the character dynamic video matched with the user photo character cutting image. In the image Motion transformation Model training process, in order to alleviate the influence of using single shielding masks (different resolution feature images focus on different emphasis), low resolution feature images focus on abstractions, and high resolution feature images focus on details) on training results by using different resolution shielding masks for different scale feature images, a method of using corresponding resolution shielding masks for different scale feature images alone, namely a multi-resolution closed mask method, is adopted by referring to a TPSMM algorithm (Thin-Plate-Spline-Motion-Model) to obtain better prediction effects.

The character detection model can specifically adopt S3FD and the like, inputs a photo containing the character of the user into a preset character detection model, extracts character key point feature vectors, automatically frames a character rectangular range in the photo, and cuts out a character rectangular selection frame for storage. If a plurality of character images are detected, outputting a plurality of character image rectangular selection frames and enabling a user to select, and finally determining character images meeting the requirements; if the character image is not detected, the initialization setting is terminated.

The preset action guiding video can be a character image dynamic video which is recorded by a real model and accords with human interaction and communication habits, and can also be other videos containing character images, such as character interview videos and the like. And inputting the action guiding video and the saved cut character image into an image action transformation model, and obtaining the user character dynamic video through processing such as character region feature weight calculation, key point matching, non-character region feature fusion and the like. The image Motion transformation Model can be specifically a Thin-Plate-Spline-Motion-Model Motion Model and the like. Because the video finally output to the interactors is based on the series extension of the user figure dynamic videos, the interactive experience of the interactors is improved, the jitter perception of video frame switching is reduced, the video smoothness is improved, and the generated user figure dynamic videos are subjected to time reverse series connection to obtain the final figure dynamic video.

S2, extracting voiceprint features of the user according to the audio uploaded by the user, and modeling aiming at tone to form a user-specific tone conversion model.

In some embodiments, the user-specific timbre conversion inference model is obtained according to the user-uploaded short audio training, as shown in fig. 3, and the specific process is as follows: s21, acquiring user uploading audio; s22, extracting voiceprint features of a user by using the coding model; s23, binding voiceprint characteristic information and a vocoder to perform countermeasure training; s24, training and generating a user-specific tone conversion reasoning model.

The audio file uploaded by the user is audio which is recorded by the user according to preset texts and has unified emotion and consistent pronunciation style, or audio data communicated in daily life.

After the user uploads the audio file, the audio file is subjected to the processing of reducing environmental noise, removing noise and enhancing human voice, and relatively pure human voice signals are extracted and then subjected to subsequent processing.

The coding model extracts the characteristic vector of the voiceprint of the user in the user uploading audio and is used for training and learning the pronunciation rules of the user. The coding model can specifically adopt Hubert and the like. And binding voiceprint characteristic information and the vocoder to perform countermeasure training, and obtaining a user-specific tone conversion reasoning model. The vocoder can specifically adopt NSF-HiFiGAN and the like.

The system automatically acquires the user id which is initialized and set, acquires the user character image id and the tone conversion inference model id which are bound with the user id, and iterates based on the user character image id and the tone conversion inference model id.

S3, obtaining a voice stream file input by the interactor, and obtaining a reply text by combining an intelligent chat robot tool.

In some embodiments, the voice file input by the interactor is obtained and a proper reply text is given, as shown in fig. 4, the specific process is as follows: s31, the system collects a problem voice file proposed by an interactor; s32, calling a Whisper voice to text engine interface, and converting the interactive question of the audio version into the interactive question of the text version; s33, inputting the text questions into an intelligent chat robot tool, and giving appropriate reply text through context and history interaction by the tool.

The intelligent chat robot tool accessed in the method is a large language model based on OpenAI, and can specifically adopt GPT-3.5, GPT-4, GLM and the like to support knowledge questions and answers in the knowledge fields of history, geography, science, culture, humanity, sports and the like, and reject to answer the questions of political trends, highly specialized classes, user privacy classes and the like.

S4, obtaining the reply text of the step S3, converting the reply text into an audio stream file, and generating reply audio with the tone characteristics of the user by combining a tone conversion reasoning model.

In some embodiments, reply audio with user tone characteristics is generated according to the reply text, as shown in fig. 5, and the specific process is as follows: s41, obtaining the reply text generated in the step S3; s42, calling a text-to-speech (TTS) synthesis interface to convert the reply text into an audio stream file corresponding to the reply text; s43, inputting the audio stream file into a tone conversion reasoning model to generate a reply audio consistent with the tone of the user in a reasoning way.

Step S42, according to the reply text, parameters such as the sex of the sounder, the volume of the sound, the speed of the sound and the like are selected, a text-to-speech TTS synthesis interface is called, and the reply text is converted into an audio stream file corresponding to the reply text.

And step S43, inputting the audio stream file into a tone color conversion reasoning model, selecting parameters such as configuration files, pitch sizes and the like corresponding to the reasoning model, and reasoning and synthesizing reply audio consistent with tone colors.

S5, synthesizing a reply video with the lip shape completely consistent with pronunciation according to the reply audio file with the tone characteristic of the user and the image video file of the user character, and outputting the reply video to the interactor.

In some embodiments, the reply audio file and the character dynamic video are fused to generate a reply video for the interactors, as shown in fig. 6, and the specific process is as follows: s51, obtaining the reply audio file generated in the step S4 and the user figure dynamic video file generated in the step S1; s52, synchronously inputting the audio file and the video file to a lip action migration model, and fusing a character image dynamic video and a reply audio with user tone characteristics to form a reply video with lip action consistent with reply content pronunciation; and S53, enhancing and repairing the reply video portrait area, and performing superprocessing on the non-portrait area to synthesize a high-image-quality reply video and output the high-image-quality reply video to interactors.

Step S52 is described above, for the user character dynamic video, traversing each frame of image in the video, extracting character feature vectors of each frame one by one, reading the reply audio with the user tone generated in step 4, extracting audio spectrum data, calling the lip shape action migration algorithm, and generating a reply video file with consistent lip shape and pronunciation. The Lip shape action migration algorithm is based on a complete pre-training algorithm tool, and is used for learning and training by training a large amount of audio frequency and Lip action data and using a recurrent neural network deep learning model, and particularly can adopt Wav2lip+GAN and the like.

In the step S53, the reply video portrait area is enhanced and repaired, and the non-portrait area is processed in an oversubstantial manner, so that a high-quality reply video is synthesized and output to the interactors. And improving the display effect of the reply video by using the repair amplifying tool, and performing character image enhancement repair on the reply video. And extracting each frame of the reply video, and enhancing and repairing the detected human images frame by frame, wherein the human image repairing provides a plurality of repairing schemes such as clearness, naturalness, detail preservation and the like. Specific portrait augmentation and repair may employ Real-ESR, codeFormer, GPEN, etc. The super processing is carried out on the non-portrait area in each frame, specifically, real-ESRNet and the like can be adopted to carry out 1-time, 2-time and 4-time amplification, and finally, the processed video frames are spliced into high-image-quality reply video through ffmpeg.

As shown in fig. 7, the present embodiment further proposes an intelligent companion quick generation system 600, which includes a character image construction module 601, a voiceprint construction module 602, a reply text acquisition module 603, a reply audio construction module 604, and an output module 605. The character construction module 601 is configured to extract a key point feature vector of a user character according to a photograph uploaded by a user, and generate a user character dynamic video according to the key point feature vector; wherein, the user character dynamic video is a character dynamic change sequence of guiding the video according to the preset action by the user photo; the voiceprint construction module 602 is configured to extract voiceprint features of a user according to short audio uploaded by the user and model the voiceprint features to form a user-specific timbre conversion model; the reply text obtaining module 603 is configured to obtain a voice stream file input by an interactor, import the voice stream file into the intelligent chat API interface, and obtain a reply text; the reply audio construction module 604 is configured to import the reply text acquired by the reply text acquisition module 603 into the voiceprint construction module, and generate reply audio with a user tone characteristic; the output module 605 synthesizes a lip-form and pronunciation matching reply video based on the reply audio file and the character image dynamic change sequence, and outputs the reply video to the interactors.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The method for rapidly generating the intelligent companion based on the photo and the short audio is characterized by comprising the following steps:

2. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 1, wherein step S1 further comprises:

s11, acquiring a photo containing a front face full-face image of a user;

step S12, detecting a character image area in the photo, and cutting to obtain rectangular frame selection data of the character image of the user;

3. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 2, characterized in that step S13 further comprises:

synchronously inputting the action guiding video and the user character rectangular frame selection data into an image action transformation model to finally obtain a user character dynamic video;

and performing smoothing treatment on the joint of the user figure dynamic video multiframes by adopting a time reverse splicing method.

4. The method for rapidly generating intelligent chaperones based on photos and short audio according to claim 3, wherein step S13 further comprises:

and predicting the behavior of the non-portrait region through the image motion transformation model, and finally obtaining the portrait dynamic video matched with the portrait cropping image of the user photo.

5. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 1, characterized in that step S2 further comprises:

s21, acquiring short audio uploaded by a user;

s22, extracting voiceprint features of a user by using a coding model;

step S23, binding voiceprint characteristic information and a vocoder to perform countermeasure training;

and step S24, training to generate a user-specific tone color conversion reasoning model.

6. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 1, characterized in that step S3 further comprises:

s31, the system collects a problem voice file proposed by an interactor;

7. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 6, characterized in that step S4 further comprises:

step S41, obtaining the reply text generated in the step S33;

step S42, calling a text-to-speech (TTS) synthesis interface to convert the reply text into an audio stream file corresponding to the reply text;

and step S43, inputting the audio stream file into a tone conversion reasoning model to generate a reply audio consistent with the tone of the user in a reasoning way.

8. The method for rapidly generating intelligent chaperones based on photos and short audio according to claim 7, wherein in step S42: selecting the sex, volume and speed parameters of the speaker, calling a text-to-speech (TTS) synthesis interface, and converting the reply text into an audio stream file corresponding to the reply text;

in step S43, the audio stream file is input into a timbre conversion inference model, and a configuration file and pitch size parameters corresponding to the inference model are selected to perform inference synthesis on reply audio consistent with the timbre.

9. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 7, characterized in that step S5 further comprises:

step S51, obtaining the reply audio file generated in the step S43 and the user figure dynamic video file generated in the step S1; traversing each frame of image in the video, and extracting character feature vectors of each frame one by one;

step S52, synchronously inputting the audio file and the video file to a lip action migration model, and fusing a character image dynamic video and a reply audio with user tone characteristics to form a reply video with lip action consistent with reply content pronunciation;

and step S53, enhancing and repairing the reply video portrait area, and performing super processing on the non-portrait area to synthesize a high-image-quality reply video and output the high-image-quality reply video to an interactor.

10. The method for quickly generating intelligent chaperones based on photos and short audio according to claim 3, wherein a time reverse stitching method is adopted to perform smoothing processing on the joints of the user figure dynamic video multiframes, and the method specifically comprises the following steps:

decoding the dynamic video of the user figure image into a series of image frames, and storing the image frames in a list A in sequence;

storing the image frames into a list B according to the reverse sequence, if the current video has 25 frames, putting the 1 st frame into the last position, putting the 2 nd frame into the penultimate position, and the like, and finally putting the original last frame into the first position;

and splicing the image frame sequential storage list A and the reverse storage list B to form a list C, and recoding the image frames in the list C to output a new video file.

11. An intelligent companion quick generation system, comprising: