CN115148187B

CN115148187B - System implementation method of intelligent character re-engraving terminal

Info

Publication number: CN115148187B
Application number: CN202210773471.0A
Authority: CN
Inventors: 司马华鹏; 刘杰; 周雪兰
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2023-08-22
Anticipated expiration: 2042-07-01
Also published as: CN115148187A

Abstract

The embodiment of the application provides a system implementation method of an intelligent character re-engraving terminal, which comprises the following steps: acquiring a past image corresponding to an evanescent person to be recalled by a user, and generating a face video of the evanescent person according to the past image; the past image comprises a face image of the deceased, and the face video of the deceased is used for indicating the face image to change expression according to a preset mode; training a preset language model according to the past language fragments of the deceased so as to obtain a target language model; inputting a target recall subject into the target language model, and acquiring target text content corresponding to the target recall subject; training a preset speech synthesis model according to the past audio frequency fragments of the deceased person to obtain a target speech synthesis model; inputting the target text content into the target voice synthesis model to generate target audio; and synchronously outputting the target audio to a user according to facial expression changes of the deceased in the face video of the deceased.

Description

System implementation method of intelligent character re-engraving terminal

Technical Field

The application relates to the field of intelligent terminals, in particular to a system implementation method of an intelligent character re-engraving terminal.

Background

The images of the remains, the brands and the like placed in the ancestor niche, which are the traditional ancestor niche for the worship or the sacrifice object of the user, are all static. With the transition of the times, offspring and descendants of the new generation cannot know the past of the ancestor. The ancestor is also completely lively, so that the problems of poor experience and lively feeling of the ancestor are greatly increased in the process of worship of the ancestor.

Whereas the existing ancestor niches reveal limited information about the evanescent, such as name, lifetime or at most have photos; it is difficult to preserve ancestor history. The difficulty of worship through the worship ancestor to educate the young generation about great ancestor is increased. Many times, for the young generation, there is only one fuzzy memory, if any, for the ancestor presence. Over time, this can weaken the links within the family; it is possible to break family links.

Aiming at the problem that the ancestors feel loose when the user worship in the related art, no effective solution has been proposed in the related art.

Disclosure of Invention

The embodiment of the application provides a system implementation method of an intelligent character re-engraving terminal, which at least solves the technical problem that a user has a sense of strangeness to ancestors when worship is carried out in the related technology.

The application provides a system implementation method of an intelligent character re-engraving terminal, which comprises the following steps: acquiring a past image corresponding to an evanescent person to be recalled by a user, and generating a face video of the evanescent person according to the past image; the past image comprises a face image of the deceased, and the face video of the deceased is used for indicating the face image to change expression according to a preset mode; training a preset language model according to the past language fragments of the deceased so as to obtain a target language model; inputting a target recall subject into the target language model, and acquiring target text content corresponding to the target recall subject; training a preset speech synthesis model according to the past audio frequency fragments of the deceased person to obtain a target speech synthesis model; inputting the target text content into the target voice synthesis model to generate target audio; and synchronously outputting the target audio to a user according to facial expression changes of the deceased in the face video of the deceased.

In one implementation manner, the acquiring a past image corresponding to an evanescent person to be recalled by a user and generating a face video of the evanescent person according to the past image specifically includes: acquiring a past image corresponding to the evanescent person; inputting the past images into a pre-trained face recognition model, recognizing a face area in the past images of the deceased, and extracting a face image corresponding to the deceased; performing restoration processing on the face image, and improving the definition of the face image; carrying out facial expression migration on the face image through a preset driving video to obtain an expression migration video corresponding to the face image; the resolution ratio of the expression migration video is improved, and super-processing is carried out on the expression migration video to obtain super-processing video; and carrying out the definition processing on the super-processed video to obtain the face video of the evanescent person.

In one implementation manner, the performing facial expression migration on the face image through a preset driving video to obtain an expression migration video corresponding to the face image specifically includes: setting a driving video, wherein the driving video is a real person video which is recorded by a real person after carrying out expression change according to a preset expression change mode, or other videos comprising the preset expression change mode; synchronously inputting the driving video and the face image into a pre-trained expression migration model; and migrating the character expression change in the driving video to the face image through the expression migration model, and acquiring an expression migration video corresponding to the face image.

In one implementation manner, the resolution of the expression migration video is improved, the expression migration video is subjected to super-processing, and a super-processing video is obtained, which specifically includes: setting the initial resolution of the expression migration video; and lifting the expression migration video frame by frame, and lifting the resolution of the expression migration video to the target resolution.

In one implementation, training a preset language model according to the past language fragments of the deceased to obtain a target language model includes: acquiring a language fragment of the past of the deceased; training the preset language model by taking the past language fragments of the deceased as training samples to generate a target language model with language characteristics and language habits of the deceased; inputting a target recall subject to the target language model to generate target text content corresponding to the target recall subject, wherein the target text content is text content generated corresponding to language features and language habits of the deceased learned according to the target language model.

In one implementation, the language snippet of the evanescent is a past communication text provided by the user between the evanescent and the user, or a past language material text communicated for a particular event, or a text written by the evanescent.

In one implementation, the language model may also be trained by a professional sample of a different professional domain, so that the language model generates a professional topic text according to the professional domain.

In one implementation, training a preset speech synthesis model according to a past audio segment corresponding to the deceased to obtain a target speech synthesis model, specifically including: acquiring a past audio fragment corresponding to the evanescent person; training a preset voice synthesis model by taking a past audio fragment corresponding to the deceased as a training sample to generate a target voice synthesis model with the audio characteristics of the deceased; and inputting the target text content into the target voice synthesis model, generating audio corresponding to the target text content and outputting the audio to the user.

In one implementation, the number of the users to recall may be one, or the number of the users to recall may be different.

In one implementation, the past image corresponding to the deceased further includes images of other parts of the deceased, and training is performed on the images of other parts of the deceased through the preset speech synthesis model and the language model to generate dynamic videos of other parts of the deceased.

According to the technical scheme, the system implementation method of the intelligent character re-engraving terminal provided by the application can realize real-time video, audio and language interaction between the user and the recall object, so that the recall object is built from three dimensions of video, audio and language style and presented to the user, effective interaction is formed between the user and the recall object, and the user forms the sense of reality that the recall object is still on the body.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a system implementation method of an intelligent character re-engraving terminal according to an embodiment of the application;

FIG. 2 is a flow chart of a method for acquiring a past image corresponding to an evanescent user to recall and generating a face video of the evanescent user according to the past image according to the embodiment of the application;

fig. 3 is a schematic flow chart of facial expression migration for a facial image through a preset driving video to obtain an expression migration video corresponding to the facial image according to the embodiment of the present application;

FIG. 4 is a flow chart of training a preset language model according to a past language segment of a dead person to obtain a target language model according to an embodiment of the present application;

fig. 5 is a flow chart of training a preset speech synthesis model according to a past audio segment corresponding to a deceased person to obtain a target speech synthesis model according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

As shown in fig. 1, an embodiment of the present application provides a system implementation method of an intelligent character multi-engraving terminal, where the method includes:

s1, acquiring a past image corresponding to an evanescent person to be recalled by a user, and generating a face video of the evanescent person according to the past image; the past image comprises a face image of the deceased, and the face video of the deceased is used for indicating the face image to change expression according to a preset mode.

In some embodiments, the evanescent wave that the user wants to recall may be one, or a corresponding different age phase of the same evanescent wave, or a plurality of different evanescent waves.

It should be noted that, in the present application, the evanescent to recall may indicate the ancestor of the user, or may be other relatives and friends of the user, or the object for recall by the user and the public. Further, the number of the evanescent persons to be recalled may be plural, and the evanescent persons may be respectively corresponding to different ages, for example, three subjects may be respectively corresponding to the images of the evanescent persons in three ages 18 to 30 years old, 40 to 60 years old, and 70 to 90 years old.

In some embodiments, the capturing a past image corresponding to an evanescent user to be recalled by the user and generating a face video of the evanescent user according to the past image, as shown in fig. 2, specifically includes: s11, acquiring a past image corresponding to the evanescent person; s12, inputting the past images into a pre-trained face recognition model, recognizing a face area in the past images of the deceased, and extracting a face image corresponding to the deceased; s13, carrying out restoration processing on the face image, and improving the definition of the face image; s14, carrying out facial expression migration on the face image through a preset driving video, and obtaining an expression migration video corresponding to the face image; s15, improving the resolution ratio of the expression migration video, and performing super-processing on the expression migration video to obtain super-processing video; s16, performing the definition processing on the super-processing video to obtain the face video of the evanescent.

Illustratively, the past image is a photograph of the user-provided evanescent. The face region identification method can be used for inputting the past images corresponding to the deceased persons to be recalled by the user into a pre-trained face identification model so as to identify the face regions in the past images and extract the corresponding face images; the face recognition model may specifically adopt OpenFace, etc. If the resolution of the extracted face image is lower, the face can be restored to improve the definition of the face image.

In some embodiments, the performing facial expression migration on the face image through a preset driving video, and obtaining an expression migration video corresponding to the face image, as shown in fig. 3, specifically includes: s111, setting a driving video, wherein the driving video is a real person video which is recorded by a real person after carrying out expression change according to a preset expression change mode, or other videos comprising the preset expression change mode; s112, synchronously inputting the driving video and the face image into a pre-trained expression migration model; s113, migrating the character expression change in the driving video to the face image through the expression migration model, and acquiring an expression migration video corresponding to the face image.

Specifically, a driving video is preset, and the expression change of the person in the driving video is migrated to the face image so as to obtain an expression migration video corresponding to the face image; the expression migration video refers to a video in which facial images are subjected to corresponding expression change according to a character expression change mode in a driving video.

The driving video may be a real person video in which a real person performs an expression change according to a preset expression change mode and records the real person video, or may be other video including a preset expression change mode, such as an animated character video, a movie/television fragment, etc. The specific process of the expression change migration is that the driving video and the facial image are synchronously input into a pre-trained expression migration model so as to output the expression migration video. The expression migration video can specifically adopt reactgan and the like.

In some embodiments, the method includes the steps of improving the resolution of the expression migration video, performing super-processing on the expression migration video, and obtaining a super-processed video, including: setting the initial resolution of the expression migration video; and lifting the expression migration video frame by frame, and lifting the resolution of the expression migration video to the target resolution.

Specifically, the original resolution of the expression migration video is set to be 256x256, and the expression migration video can be improved frame by frame, so that the resolution of the expression migration video is improved to 512x512, and further super-division processing of the expression migration video is realized.

Combining the information of adjacent frames in the expression migration video after the super-division processing, further improving the definition of the expression migration video, and obtaining a target video; the target video is a dynamic photograph of the facial expression of the human face in the target image according to the mode set in S1.

In some embodiments, the past images corresponding to the deceased further include images of other parts of the deceased, and training the images of other parts of the deceased through the preset speech synthesis model and the language model to generate dynamic videos of other parts of the deceased.

It should be noted that, the implementation method of the system of the intelligent character re-engraving terminal provided by the application is not limited to a simple face, and other parts of the deceased person can also be subjected to the generation of dynamic images of other parts through the corresponding model in the mode.

The system implementation method of the intelligent character re-engraving terminal provided by the application can respectively execute the steps aiming at the same past image of the deceased to generate a plurality of face videos of the deceased so as to respectively correspond to the same past image and adopt different expression changes; thus, different face videos of the deceased can be displayed according to different settings.

S2, training a preset language model according to the past language fragments of the deceased so as to obtain a target language model;

in some embodiments, training a preset language model according to the past language fragments of the deceased to obtain a target language model, as shown in fig. 4, includes: s21, acquiring a language fragment of the past of the deceased; s22, training the preset language model by taking the past language fragments of the deceased as training samples to generate a target language model with language characteristics and language habits of the deceased; s23, inputting a target recall subject to the target language model to generate target text content corresponding to the target recall subject, wherein the target text content is text content corresponding to language features and language habits of the deceased learned according to the target language model.

Further, the target language is a communication text provided by the user between the deceased and the user, or a language material text communicated for a specific event, or a text written by the deceased.

The target language further comprises professional samples aiming at different professional fields, and the language model is trained through the professional samples, so that the language model generates professional topic text according to the professional fields.

Specifically, a past language fragment corresponding to the evanescent is obtained. Generally, the past language segments are language data provided by the user and communicated with the user in daily life before the user, or language data communicated for specific events, notes, letters and the like of the user; the past language fragments are presented in text form.

Further, by training the language model based on the past language fragments, the content output by the language model can refer to the language style, habit and the like corresponding to the ancestor of the user corresponding to the past language fragments. The language model can specifically adopt GPT-2, GPT-3 and the like.

S3, inputting a target recall topic into the target language model, and acquiring target text content corresponding to the target recall topic;

specifically, the target recall subject is input into the target language model, and corresponding target text content which is generated correspondingly according to the language style, habit and the like of the deceased learned by the language model can be generated according to the target recall subject.

Generally, the language model can determine corresponding output for the consciousness problem according to the input of the user, namely the target recall subject; thus, for the consensus domain, existing language models can be directly employed for user input to generate corresponding target topics. If the user expects to output the text content which can be correspondingly generated according to the experience or concept of the deceased, the text content is arranged in a question-answer sample mode, so that the language model is trained, the language model can generate a target theme according to the experience or concept of the deceased, and further, the target text content is generated according to the language habit, style and the like of the deceased. If the user expects to output text content which can be correspondingly generated for the specific field, such as the psychological field, a professional sample of the psychological field is needed to train the language model, so that the language model can generate a target theme according to professional knowledge in the field, and further generate a target text according to language habits, styles and the like of the evanescent person.

S4, training a preset speech synthesis model according to the past audio clips of the deceased person to obtain a target speech synthesis model;

in some embodiments, training a preset speech synthesis model according to the past audio segment corresponding to the deceased person to obtain a target speech synthesis model, as shown in fig. 5, specifically includes: s41, acquiring a past audio fragment corresponding to the evanescent; s42, training a preset voice synthesis model by taking a past audio fragment corresponding to the deceased as a training sample to generate a target voice synthesis model with the audio characteristics of the deceased; s43, inputting the target text content into the target speech synthesis model, generating audio corresponding to the target text content and outputting the audio to the user.

In the above embodiment, the past audio corresponding to the evanescent person is obtained, and in general, the past audio is the audio data of the evanescent person provided by the user. And training a preset speech synthesis model by taking the past audio as a training sample. By training the speech synthesis model based on the past audio, the audio output by the speech synthesis model can refer to the sound of the deceased corresponding to the past audio. The speech synthesis model may specifically employ a Merlin system, an end-to-end system, and the like.

In some embodiments, the language fragments of the evanescent past are past communication text provided by the user between the evanescent and the user, or past language material text communicated for a particular event, or text written by the evanescent.

In some embodiments, the language model may also be trained by a professional sample of a different professional domain, so that the language model generates a professional topic text according to the professional domain.

S5, inputting the target text content into the target voice synthesis model to generate target audio;

specifically, the target audio is audio in which the target text content is to be formulated in the sound of the deceaser.

Specifically, the target text content may be preset, for example, a preset blessing or an option is actively broadcast to the user in a preset holiday or date; the target text content may also be feedback to the user, e.g., feedback content determined by a keyword library or language model for user input.

S6, synchronously outputting the target audio to a user according to facial expression changes of the deceased in the face video of the deceased.

Specifically, the target language model obtained through training generates corresponding target text content, and the target speech synthesis model obtained through training further generates corresponding output audio according to the target text content for output.

The above is a detailed description of each step in the system implementation method of the intelligent character re-engraving terminal provided by the application. The following are specific examples provided herein.

Example embodiment 1

In this exemplary embodiment, the deceased to recall is a single user ancestor.

(1) Acquiring a past image corresponding to a user ancestor, and generating a face video according to the past image;

(2) Training a preset language model according to the past language fragments corresponding to the user ancestors to obtain a target language model;

(3) Inputting a target recall subject into the target language model, and acquiring target text content corresponding to the target recall subject;

(4) Training a preset speech synthesis model according to the past audio clips corresponding to the user ancestors to obtain a target speech synthesis model;

(5) Inputting the target text content into the target voice synthesis model, and generating audio corresponding to the target text content;

(6) And synchronously outputting the audio to the user according to facial expression changes of the ancestors of the user in the facial video.

In the exemplary embodiment, a previous image, a previous audio segment, and a previous language segment of the user ancestor are obtained respectively, a video is generated according to the previous image, training of a speech synthesis model is completed according to the previous audio segment, and training of a language model is completed according to the previous language segment. The past audio clip may be an image data of the ancestor at the time of the world, the past language clip may be a record of the user's communication with the ancestor at the time of the world through social software, or an article written by the ancestor in person, etc.

Example embodiment 2

In this exemplary embodiment, the evanescent is a single user ancestor.

(3) Inputting a target recall subject into the target language model, and acquiring target text content corresponding to the target subject;

In this exemplary embodiment, the past image may correspondingly generate a plurality of facial videos (i.e. by selecting a plurality of driving videos, corresponding to different expression changes respectively) so as to correspond to different expression changes of the ancestor of the user.

In this exemplary embodiment, different expression changes may be selected for output according to seasons, holidays, and user inputs.

Example embodiment 3

In the present exemplary embodiment, the number of evanescent persons is three, corresponding to three age groups of 18 to 30 years, 40 to 60 years, 70 to 90 years of user ancestors, respectively.

(1) Acquiring past images corresponding to three age groups of 18 to 30 years old, 40 to 60 years old and 70 to 90 years old of the user ancestor, and generating a facial video according to the past images;

(2) Training a preset language model according to past language fragments corresponding to three age groups of 18 to 30 years old, 40 to 60 years old and 70 to 90 years old of the user ancestor to obtain a target language model;

(4) Training a preset voice synthesis model according to the past audio clips corresponding to the three ages of 18 to 30 years old, 40 to 60 years old and 70 to 90 years old of the user ancestor so as to obtain a target voice synthesis model;

In the present exemplary embodiment, for the available past audio clips and past language clips, corresponding audio and text are adopted; for the non-available past audio segments and past language segments, for example, the ancestor fails to record corresponding audio and text in the age range of 18 to 30 years, the audio of the person similar to the age range is adopted, and the publication of the ancestor in the corresponding year of 18 to 30 years is adopted as the past audio and language of the ancestor of the user.

Example embodiment 4

In the present exemplary embodiment, the number of evanescents corresponds to a plurality of different ancestors in the user family (hereinafter described as family members A, B, C, D, where a is the father of B, B is the sibling of C, B is the father of D, and D is the father of the user).

(1) Respectively acquiring a past image corresponding to A, B, C, D, and generating a face video according to the past image;

(2) Training a preset language model according to the past language segments corresponding to A, B, C, D to obtain a target language model;

(4) Training a preset voice synthesis model according to the corresponding past audio fragment of A, B, C, D to obtain a target voice synthesis model;

(6) And synchronously outputting the audio to a user according to the facial expression change corresponding to A, B, C, D in the target video.

Through the above exemplary embodiment, the system implementation method of the intelligent character re-engraving terminal provided by the application can enable the user ancestor and the user to form interaction of three layers of video, audio and language, thereby enabling offspring to know the smile of the ancestor more and increasing the affinity.

Reference throughout this specification to "an embodiment," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, component, or characteristic described in connection with the embodiment is included in at least one embodiment, and thus the phrases "in embodiments," "in some embodiments," "in at least another embodiment," or "in embodiments," etc., appearing throughout the specification do not necessarily all refer to the same embodiment. Furthermore, the particular features, components, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, component, or characteristic shown or described in connection with one embodiment may be combined, in whole or in part, with features, components, or characteristics of one or more other embodiments, without limitation. Such modifications and variations are intended to be included within the scope of the present application.

The above-provided detailed description is merely a few examples under the general inventive concept and does not limit the scope of the present application. Any other embodiments which are extended according to the solution of the application without inventive effort fall within the scope of protection of the application for a person skilled in the art.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. The method for realizing the system of the intelligent character re-engraving terminal is characterized by comprising the following steps of:

acquiring a past image corresponding to an evanescent person to be recalled by a user, and generating a face video of the evanescent person according to the past image; the past image comprises a face image of the deceased, and the face video of the deceased is used for indicating the face image to change expression according to a preset mode;

training a preset language model according to the past language fragments of the deceaser and professional samples of different professional fields to generate a target language model with language characteristics and language habits of the deceaser, wherein the past language fragments of the deceaser are past communication texts between the deceaser and the user provided by the user, or past language material texts communicated for specific events, or texts written by the deceaser;

inputting a target recall subject into the target language model, and acquiring target text content corresponding to the target recall subject, wherein the target text content is text content generated corresponding to language features and language habits of the deceased learned according to the target language model;

training a preset speech synthesis model according to the past audio frequency fragments of the deceased person to obtain a target speech synthesis model;

inputting the target text content into the target voice synthesis model to generate target audio;

and synchronously outputting the target audio to a user according to facial expression changes of the deceased in the face video of the deceased.

2. The method according to claim 1, wherein the step of acquiring the past image corresponding to the user's recall target and generating the face video of the user's recall target according to the past image comprises:

acquiring a past image corresponding to the evanescent person;

inputting the past images into a pre-trained face recognition model, recognizing a face area in the past images of the deceased, and extracting a face image corresponding to the deceased;

performing restoration processing on the face image, and improving the definition of the face image;

carrying out facial expression migration on the face image through a preset driving video to obtain an expression migration video corresponding to the face image;

the resolution ratio of the expression migration video is improved, and super-processing is carried out on the expression migration video to obtain super-processing video;

and carrying out the definition processing on the super-processed video to obtain the face video of the evanescent person.

3. The method according to claim 2, wherein the performing facial expression migration on the face image through a preset driving video, and obtaining an expression migration video corresponding to the face image, specifically includes:

setting a driving video, wherein the driving video is a real person video which is recorded by a real person after carrying out expression change according to a preset expression change mode, or other videos comprising the preset expression change mode;

synchronously inputting the driving video and the face image into a pre-trained expression migration model;

and migrating the character expression change in the driving video to the face image through the expression migration model, and acquiring an expression migration video corresponding to the face image.

4. The method according to claim 2, wherein the enhancing the resolution of the expression migration video, performing the super-processing on the expression migration video, and obtaining the super-processed video, specifically includes:

setting the initial resolution of the expression migration video;

and lifting the expression migration video frame by frame, and lifting the resolution of the expression migration video to the target resolution.

5. The method of claim 1, wherein training a preset speech synthesis model according to the past audio segments corresponding to the deceased person to obtain a target speech synthesis model specifically comprises:

acquiring a past audio fragment corresponding to the evanescent person;

training a preset voice synthesis model by taking a past audio fragment corresponding to the deceased as a training sample to generate a target voice synthesis model with the audio characteristics of the deceased;

and inputting the target text content into the target voice synthesis model, generating audio corresponding to the target text content and outputting the audio to the user.

6. The method of claim 1, wherein the user's desired recall is one, or the same user's corresponding different age groups, or a plurality of different users.

7. The method of claim 1, wherein the past images corresponding to the deceased further comprise images of other parts of the deceased, and training is performed through the preset image synthesis module to generate dynamic videos of other parts of the deceased.