CN113868472A - Method for generating digital human video and related equipment - Google Patents

Method for generating digital human video and related equipment Download PDF

Info

Publication number
CN113868472A
CN113868472A CN202111212350.0A CN202111212350A CN113868472A CN 113868472 A CN113868472 A CN 113868472A CN 202111212350 A CN202111212350 A CN 202111212350A CN 113868472 A CN113868472 A CN 113868472A
Authority
CN
China
Prior art keywords
reply
image
limb key
frame
key points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111212350.0A
Other languages
Chinese (zh)
Inventor
杨国基
刘致远
穆少垒
刘炫鹏
王鑫宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202111212350.0A priority Critical patent/CN113868472A/en
Publication of CN113868472A publication Critical patent/CN113868472A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a method for generating a digital human video and related equipment, which are used for improving user experience. The method in the embodiment of the application comprises the following steps: collecting a reply action video of a customer service person; inputting the reply action video into a pre-trained action detection model to obtain an initial limb key point corresponding to each frame of reply action image; detecting an abnormal reply action image of which the initial limb key point does not accord with a preset condition in the multi-frame reply action image; processing the initial limb key points of the abnormal response action image to obtain target limb key points; inputting the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image into a pre-trained image generation model to obtain multi-frame response limb images of the digital character model; and generating a digital person reply video according to the multi-frame reply limb image.

Description

Method for generating digital human video and related equipment
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method for generating a digital human video and related equipment.
Background
In recent years, the digital human customer service is appeared in the market to serve users, and compared with the text robot customer service, the digital human customer service can provide users with better impression.
Currently, reply audio information and reply image information of each frame of customer service of a real person are generally obtained, a mouth shape key point of each frame of customer service of the real person is obtained through a mouth shape parameter model, a limb key point of each frame of customer service of the real person is obtained through an action detection model, multi-frame total key points are generated according to the multi-frame mouth shape key points and the multi-frame limb key points, then, an image generation model generates a plurality of digital human image frames according to the multi-frame total key points, and finally, a complete digital human reply video can be synthesized according to the plurality of digital human image frames.
However, when the acquired reply image information of a frame of real person customer service has an unqualified motion, for example, a limb key point exceeds an image range, the motion detection model cannot acquire the limb key point of the frame, so that the image generation model cannot generate an image frame of the frame, and a response video of an abnormal digital person with the missing image of the frame is received, which brings a bad experience to a user.
Disclosure of Invention
In a first aspect, an embodiment of the present application provides a method for generating a digital human video, including:
collecting a reply action video of a customer service person, wherein the reply action video comprises a plurality of reply action images;
inputting the reply action video into a pre-trained action detection model to obtain initial limb key points corresponding to each frame of reply action image, wherein the initial limb key points are limb key points of a digital character model;
detecting an abnormal reply action image of which the initial limb key point does not accord with a preset condition in the multi-frame reply action image;
processing the initial limb key points of the abnormal response action image to obtain target limb key points;
inputting the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image into a pre-trained image generation model to obtain multi-frame response limb images of the digital character model;
and generating a digital person reply video according to the multi-frame reply limb image.
Optionally, the method further includes:
collecting the reply audio frequency of the customer service personnel;
inputting the reply audio to a pre-trained tone migration model to obtain a target audio after tone conversion;
inputting the reply audio to a pre-trained mouth shape parameter model to obtain a target mouth shape matched with the reply audio;
inputting the reply mouth shape into a pre-trained image generation model to obtain a multi-frame reply mouth shape image of the digital character model;
generating a digital person reply video according to the plurality of frames of limb reply images, and further comprising:
and generating a digital person reply video according to the target audio, the multi-frame reply mouth shape image and the multi-frame reply limb image.
Optionally, in the multiple frames of reply motion images, detecting an abnormal reply motion image in which the initial limb key point does not meet the preset condition includes:
judging whether the initial limb key points corresponding to each frame of the reply action image are all in a legal working range;
or the like, or, alternatively,
judging whether the initial limb key points corresponding to each frame of the reply action image indicate legal actions or not;
if not, determining that the reply action image is an abnormal reply action image.
Optionally, the processing the initial limb key point of the abnormal reply motion image to obtain a target limb key point includes:
determining the initial limb key point of the abnormal reply motion image in the previous frame of the abnormal reply motion image as the related limb key point of the abnormal reply motion image;
and inputting the related limb key points into an action correction model to obtain target limb key points of the abnormal response action image.
Optionally, the inputting the relevant limb key points into the motion correction model to obtain the target limb key points of the abnormal reply motion image includes: ,
inputting the related limb key points into a motion correction model so that the motion correction model calculates the similarity between the related limb key points and each group of preset standard limb key points;
and determining the standard limb key point indicated by the highest similarity in the plurality of similarities as the target limb key point of the abnormal reply action image.
A second aspect of the embodiments of the present application provides a device for generating a digital human video, including:
the system comprises a collecting unit, a processing unit and a processing unit, wherein the collecting unit is used for collecting a reply action video of a customer service worker, and the reply action video comprises a plurality of reply action images;
the input unit is used for inputting the reply action video into a pre-trained action detection model to obtain initial limb key points corresponding to each frame of reply action image, wherein the initial limb key points are limb key points of a digital character model;
the detection unit is used for detecting an abnormal reply action image of which the initial limb key point does not accord with a preset condition in the multi-frame reply action image;
the processing unit is used for processing the initial limb key points of the abnormal response motion image to obtain target limb key points;
the input unit is further configured to input the target limb key point of the abnormal reply motion image and the initial limb key point of another reply motion image except the abnormal reply motion image into a pre-trained image generation model, so as to obtain a multi-frame reply limb image of the digital character model;
and the generating unit is used for generating a digital person reply video according to the multi-frame reply limb image.
Optionally, the collecting unit is further configured to collect a reply audio of the customer service staff;
the input unit is further used for inputting the reply audio to a pre-trained tone migration model to obtain a target audio after tone conversion;
the input unit is further used for inputting the reply audio to a pre-trained mouth shape parameter model to obtain a target mouth shape matched with the reply audio;
the input unit is also used for inputting the reply mouth shape into a pre-trained image generation model to obtain a multi-frame reply mouth shape image of the digital character model;
the generating unit is further used for generating a digital person reply video according to the target audio, the multi-frame reply mouth shape image and the multi-frame reply limb image.
Optionally, the detection unit is specifically configured to determine whether the initial limb key points corresponding to each frame of the reply action image are all within a legal working range;
or the like, or, alternatively,
judging whether the initial limb key points corresponding to each frame of the reply action image indicate legal actions or not;
if not, determining that the reply action image is an abnormal reply action image.
Optionally, the processing unit is specifically configured to determine an initial limb key point of an abnormal reply motion image of a frame before the initial limb key point of the abnormal reply motion image as a relevant limb key point of the abnormal reply motion image;
and inputting the related limb key points into an action correction model to obtain target limb key points of the abnormal response action image.
Optionally, the processing unit is specifically configured to input the relevant limb key points into an action correction model, so that the action correction model calculates similarity between the relevant limb key points and each group of preset standard limb key points;
and determining the standard limb key point indicated by the highest similarity in the plurality of similarities as the target limb key point of the abnormal reply action image.
A third aspect of the embodiments of the present application provides a device for generating a digital human video, including:
the system comprises a central processing unit, a memory and an input/output interface;
the memory is a transient memory or a persistent memory;
the central processor is configured to communicate with the memory and execute the operations of the instructions in the memory to perform the method of the first aspect.
A fourth aspect of the embodiments of the present application provides a computer storage medium, where instructions are stored in the computer storage medium, and when the instructions are executed on a computer, the instructions cause the computer to perform the method according to the first aspect.
According to the technical scheme, the embodiment of the application has the following advantages: the target limb key point of the abnormal response motion image can be obtained by processing the initial limb key point of the abnormal response motion image through the motion correction model, and then the final digital human response video is obtained according to the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image. Under the condition that a certain frame of reply action image is missing, a digital man reply video without frame number missing can be generated, and the user experience is ensured.
Drawings
Fig. 1 is an architecture diagram of a method for generating a digital human video according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for generating a digital human video according to an embodiment of the present application;
FIG. 3 is another flow chart of a method for generating a digital human video according to an embodiment of the present application;
FIG. 4 is a block diagram of an apparatus for generating digital human video according to an embodiment of the present application;
fig. 5 is another structural diagram of a device for generating digital human video according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, the popularity of mobile terminal devices such as mobile phones and the like is higher and higher, and smart phones become essential personal belongings for people going out. With the rapid development of the mobile internet, various applications appear on the mobile terminal, and many of the applications can provide customer service functions for users, so that the users can perform services such as product consultation and the like through the customer service.
With the development of the technology level, the requirements of people on humanized experience in the use process of various intelligent products are gradually increased, and in the process of communicating with customer service, a user also hopes that the user can not only obtain the reply of characters or voice, but also can communicate in a more natural interaction mode similar to interpersonal communication in actual life.
The inventor finds in research that the intimacy of customer service can be improved by enabling the digital customer service person to collect the response information of the customer service person in real time, such as response audio, response action images and the like. For example, when a customer service person and a user have a conversation, the reply content of the customer service person for the user consultation can be expressed in a video mode through a virtual character image, so that the user can visually see that the customer service robot with the virtual character image speaks on a user interface, the user can not feel obvious difference when the digital human service is switched to the customer service person service, all the customer service persons of an enterprise can provide service for the user through the uniform digital human image, and the user experience is improved to a certain extent.
However, in the actual research process, the inventor finds that when a customer service staff encounters an unexpected event, the acquired response action image information may be incomplete, and generation of a digital person response video corresponding to the frame of response action image is limited, and the frame of digital person image received by the user may be completely black or cannot be received by the user, which affects user experience.
In order to improve the above problems, the inventor researches a difficult point in the implementation process, and more comprehensively considers the use requirements in the actual interactive scene, and proposes a method for generating a digital human video and related equipment of the embodiment of the application
In order to better understand the method for generating digital human video and the related device provided in the embodiments of the present application, an application environment suitable for the embodiments of the present application is described below.
Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The data processing method provided by the embodiment of the application can be applied to the interactive system 100 shown in fig. 1. The interactive system 100 comprises a terminal device 101 and a server 102, wherein the server 102 is in communication connection with the terminal device 101. The server 102 may be a conventional server or a cloud server, and is not limited herein.
The terminal device 101 may be various electronic devices that have a display screen, a data processing module, a camera, an audio input/output function, and the like, and support data input, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a self-service terminal, a wearable electronic device, and the like. Specifically, the data input may be inputting voice based on a voice module provided on the electronic device, inputting characters based on a character input module, and the like.
The terminal device 101 may have a client application installed thereon, and the user may be based on the client application (for example, APP, wechat applet, and the like), where the conversation robot in the embodiment of the present application is also a client application configured in the terminal device 101. A user may register a user account in the server 102 based on the client application program, and communicate with the server 102 based on the user account, for example, the user logs in the user account in the client application program, inputs information through the client application program based on the user account, and may input text information or voice information, and the like, after receiving the information input by the user, the client application program may send the information to the server 102, so that the server 102 may receive, process, and store the information, and at the same time, the server 102 may display the information input by the user to the customer service staff, so that the customer service staff may reply, and the server 102 may also receive the information in real time and return output information corresponding to the customer service staff to the terminal device 101 in real time according to the information.
When the digital customer service cannot answer the user question, the user can choose to use the artificial customer service. In order to ensure that the user does not feel obvious difference when the digital person service is switched to the customer service personnel service and ensure the impression of the user to a certain extent, the following digital person video generation method can be adopted to convert the response information of the customer service personnel into the digital character model and generate the response video of the digital person, and then the response video received by the user is the digital person image.
Referring to fig. 2, an embodiment of the present application provides a method for generating a digital human video, including:
201. and collecting a reply action video of the customer service personnel, wherein the reply action video comprises a plurality of frames of reply action images.
When the customer service staff starts to answer the user, the answering action images of each frame of the customer service staff are collected in real time, and a plurality of frames of answering action images which are sequentially sequenced are answering action videos.
202. Inputting the reply action video into a pre-trained action detection model to obtain initial limb key points corresponding to each frame of reply action image, wherein the initial limb key points are the limb key points of the digital character model.
The motion detection model is trained in advance according to a plurality of standard motion images, and then the initial limb key points of each frame of response motion image can be obtained by inputting each frame of response motion image. The initial limb key points are the limb key points of the digital character model, and are in one-to-one correspondence with the reply motion images, namely, one reply motion image corresponds to one group of initial limb key points. According to the motion detection model, the limb key points required by the digital person to reproduce the limb motion of the customer service staff can be obtained. The motion detection model may be a 3D deformation (MM) model or a Mesh (Mesh) model, and is not limited herein.
203. And detecting abnormal reply action images of which the initial limb key points do not accord with preset conditions in the multi-frame reply action images.
Whether the initial limb key points of the reply action image of each frame are abnormal or not is detected, and whether the abnormality can be reproduced in the digital character model or not is detected.
204. And processing the initial limb key points of the abnormal response motion image to obtain target limb key points.
And processing the initial limb key points which cannot be reproduced in the digital character model to obtain target limb key points of the corresponding frame reply action image which can be reproduced in the digital character model.
205. Inputting the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image into a pre-trained image generation model to obtain a multi-frame response limb image of the digital character model.
The image generation model is trained in advance and can be used for generating digital human image frames according to the limb key points. Inputting the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image into the pre-trained image generation model to obtain the response limb images of each frame of the digital character model. The image generation model may be a 3D deformation (MM) model or a Mesh (Mesh) model, which is not limited herein.
206. And generating a digital person reply video according to the multi-frame reply limb images.
Digital person response videos can be obtained by sequentially arranging the plurality of frames of response limb images obtained in step 205. Specifically, the multiple frames of images of the reply limb may be arranged according to the acquisition order of the corresponding images of the reply motion.
In the embodiment of the application, the action detection model can be obtained by training through a neural network based on a large number of action videos (including action images of customer service personnel) of the customer service personnel and training samples of the limb characteristic points when the customer service personnel answer. It will be appreciated that the first machine learning model is a model for converting video to corresponding initial limb key points. By inputting the previously acquired reply motion video into the motion detection model, the initial limb key points corresponding to each frame of reply motion image can be output by the motion detection model. Wherein, the limb key point can be a key point set used for describing the whole or partial shape of the limb.
It can be known that, in practical application, an image of each frame of the customer service personnel can be obtained in real time, and converted into a digital human image in real time according to the steps and sent to the user client. Continuously collecting the reply video of each frame of the customer service personnel, then generating the image frame of the digital person in real time, continuously receiving and displaying the image frame of the digital person by the user client, and finally, the continuously played image frames of the plurality of frames watched by the user are the reply video of the digital person.
In the embodiment of the application, the target limb key point of the abnormal response motion image can be obtained by processing the initial limb key point of the abnormal response motion image through the motion correction model, and then the final digital human response video is obtained according to the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image. Under the condition that a certain frame of reply action image is missing, a digital man reply video without frame number missing can be generated, and the user experience is ensured.
In a specific embodiment, on the basis of step 201 to step 205, the method for generating a digital human video according to an embodiment of the present application further includes: collecting the reply audio frequency of the customer service personnel; inputting the reply audio to a pre-trained tone migration model to obtain a target audio after tone conversion; inputting the reply audio to a pre-trained mouth shape parameter model to obtain a target mouth shape matched with the reply audio; inputting the answer mouth shape into a pre-trained image generation model to obtain a multi-frame answer mouth shape image of the digital character model.
Specifically, in practical application, in addition to obtaining the reply action image of the customer service person, the audio information of the customer service person may also be obtained, then the pre-trained tone migration model is input to obtain the target audio after tone conversion, and the reply audio (or the target audio) may be input to the pre-trained mouth shape parameter model to obtain the target mouth shape matched with the reply audio (or the target audio). Further, step 204 requires combining the target mouth shape and the body key points of each frame to obtain a plurality of digital human image frames, and step 205 generates a digital human reply video according to the target audio and the plurality of frame image frames.
The audio migration model or the mouth shape parameter model may be a 3D deformation model or a Mesh model, and is not limited herein.
In the embodiment of the application, the audio information of customer service personnel can be collected and converted into the target audio of the timbre of the digital person, and then the audio is synchronously synthesized into the video of the digital person, so that all the customer service personnel of an enterprise can provide services for users through unified timbres, and the user experience is improved.
In a specific embodiment, step 203 is specifically implemented by: judging whether the initial limb key points corresponding to each frame of the reply action image are all in a legal working range; or, judging whether the initial limb key point corresponding to each frame of the reply motion image indicates legal motion, and if not, determining that the reply motion image is an abnormal reply motion image.
In practical application, the legal working range refers to a range which can be acquired by a camera, or a range which is preset by a certain limb key point and allowed to appear, for example, a hand cannot exceed a forehead; the legal action refers to the action that the customer service is in a normal working state and does not drink water or rub sweat with hands and the like which are not suitable for being reproduced on the digital person to be watched by the user; the judgment condition of the abnormal reply action image can also be that an object which is not allowed to appear in the acquisition range, such as a mobile phone or a water cup; the determination condition of the abnormal response motion image is not specifically limited here.
In a specific embodiment, step 204 can be specifically implemented by: determining the initial limb key point of the abnormal reply motion image in the previous frame of the abnormal reply motion image as the related limb key point of the abnormal reply motion image; inputting the related limb key points into a motion correction model so that the motion correction model calculates the similarity between the related limb key points and each group of preset standard limb key points; and determining the standard limb key point indicated by the highest similarity in the plurality of similarities as the target limb key point of the abnormal reply action image.
Specifically, the previous frame of image of the abnormal reply motion image is determined first, and then the initial limb key point of the previous frame of abnormal reply motion image is determined as the relevant limb key point of the abnormal reply motion image. The approximation degree of the related limb key points and each group of preset standard limb key points can be calculated according to the action correction model, and the group of standard limb key points with the highest similarity is determined as the target limb key points of the abnormal response action image.
In the embodiment of the application, a specific implementation mode for determining the key points of the target limb is provided, and the realizability of the scheme is improved.
Referring to fig. 3, in one embodiment, the present application is implemented as follows.
First, a reply audio and a reply action image of a real person customer service are acquired, and the reply audio and the reply action image are processed respectively. Performing speaker voice recognition and background noise elimination on the reply audio, then putting the answer audio into a tone migration model to obtain a target audio of the tone of the digital person, and inputting the target audio into a mouth shape parameter model to obtain a mouth shape key point which is matched with the target audio and is used for reproducing the mouth shape of the customer service person on the digital person model; inputting the reply action video into the action detection module to obtain initial limb key points, and then processing the initial limb key points of the abnormal reply action image through the action correction module to obtain target limb key points. And generating a model according to the mouth shape key points, the target limb key points of the abnormal reply motion image and the initial limb key points of other reply motion images except the abnormal reply motion image, and obtaining the multi-frame reply image frames of the digital character model. And finally, coding the multi-frame reply image frames through a video coder to obtain the digital human reply video reproduced on the digital character model.
Referring to fig. 4, an apparatus for generating a digital human video according to an embodiment of the present application includes:
the collecting unit 401 is configured to collect a reply action video of the customer service staff, where the reply action video includes multiple frames of reply action images;
an input unit 402, configured to input the reply action video to a pre-trained action detection model, to obtain an initial limb key point corresponding to each frame of reply action image, where the initial limb key point is a limb key point of the digital character model;
the detecting unit 403 is configured to detect, in the multi-frame reply action image, an abnormal reply action image in which the initial limb key point does not meet the preset condition;
the processing unit 404 is configured to process the initial limb key points of the abnormal response motion image to obtain target limb key points;
the input unit 402 is further configured to input the target limb key point of the abnormal response motion image and the initial limb key point of another response motion image except the abnormal response motion image into a pre-trained image generation model, so as to obtain a multi-frame response limb image of the digital character model;
the generating unit 405 generates a digital person reply video from the multi-frame reply limb image.
Optionally, the collecting unit 401 is further configured to collect a reply audio of the customer service staff;
the input unit 402 is further configured to input the reply audio to a pre-trained tone migration model to obtain a target audio after tone conversion;
the input unit 402 is further configured to input the reply audio to a pre-trained mouth shape parameter model to obtain a target mouth shape matched with the reply audio;
the input unit 402 is further configured to input the reply mouth shape into a pre-trained image generation model to obtain a multi-frame reply mouth shape image of the digital character model;
the generating unit 405 is further configured to generate a digital person reply video according to the target audio, the multiple frames of reply mouth images, and the multiple frames of reply limb images.
Optionally, the detecting unit 403 is specifically configured to determine whether the initial limb key points corresponding to each frame of reply motion image are all within a legal working range;
or the like, or, alternatively,
judging whether the initial limb key points corresponding to each frame of reply action image indicate legal actions or not;
if not, determining that the reply action image is an abnormal reply action image.
Optionally, the processing unit 404 is specifically configured to determine that an initial limb key point of an abnormal reply motion image in a frame before the initial limb key point of the abnormal reply motion image is a relevant limb key point of the abnormal reply motion image;
and inputting the related limb key points into the action correction model to obtain the target limb key points of the abnormal response action image.
Optionally, the processing unit 404 is specifically configured to input the relevant limb key points into the motion correction model, so that the motion correction model calculates similarity between the relevant limb key points and each group of preset standard limb key points;
and determining the standard limb key point indicated by the highest similarity in the plurality of similarities as the target limb key point of the abnormal response action image.
In this embodiment of the application, the processing unit 404 may process the initial limb key point of the abnormal response motion image to obtain the target limb key point of the abnormal response motion image, and then obtain the final digital person response video according to the target limb key point of the abnormal response motion image and the initial limb key points of the other response motion images except the abnormal response motion image determined by the input unit 402. Under the condition that a certain frame of reply action image is missing, a digital man reply video without frame number missing can be generated, and the user experience is ensured.
Fig. 5 is a schematic structural diagram of a device for generating a digital human video according to an embodiment of the present disclosure, where the device 500 for generating a digital human video may include one or more Central Processing Units (CPUs) 501 and a memory 505, and one or more applications or data are stored in the memory 505.
Memory 505 may be volatile storage or persistent storage, among others. The program stored in memory 505 may include one or more modules, each of which may include a series of instruction operations in a device for generating digital human video. Still further, the central processor 501 may be configured to communicate with the memory 505, and execute a series of instruction operations in the memory 505 on the digital human video generating apparatus 500.
The digital human video generating device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The central processing unit 501 may perform the operations performed by the digital human video generating apparatus in the embodiments shown in fig. 1 to fig. 3, which are not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims (10)

1. A method for generating a digital human video, comprising:
collecting a reply action video of a customer service person, wherein the reply action video comprises a plurality of reply action images;
inputting the reply action video into a pre-trained action detection model to obtain initial limb key points corresponding to each frame of reply action image, wherein the initial limb key points are limb key points of a digital character model;
detecting an abnormal reply action image of which the initial limb key point does not accord with a preset condition in the multi-frame reply action image;
processing the initial limb key points of the abnormal response action image to obtain target limb key points;
inputting the target limb key point of the abnormal response motion image and the initial limb key points of other response motion images except the abnormal response motion image into a pre-trained image generation model to obtain multi-frame response limb images of the digital character model;
and generating a digital person reply video according to the multi-frame reply limb image.
2. The method of claim 1, further comprising:
collecting the reply audio frequency of the customer service personnel;
inputting the reply audio to a pre-trained tone migration model to obtain a target audio after tone conversion;
inputting the reply audio to a pre-trained mouth shape parameter model to obtain a target mouth shape matched with the reply audio;
inputting the reply mouth shape into a pre-trained image generation model to obtain a multi-frame reply mouth shape image of the digital character model;
generating a digital person reply video according to the plurality of frames of limb reply images, and further comprising:
and generating a digital person reply video according to the target audio, the multi-frame reply mouth shape image and the multi-frame reply limb image.
3. The method according to claim 1, wherein the detecting, in the multi-frame reply motion image, an abnormal reply motion image in which an initial limb key point does not meet a preset condition comprises:
judging whether the initial limb key points corresponding to each frame of the reply action image are all in a legal working range;
or the like, or, alternatively,
judging whether the initial limb key point corresponding to each frame of the reply action image indicates legal action or not,
if not, determining that the reply action image is an abnormal reply action image.
4. The method according to claim 1, wherein the processing of the initial limb key points of the abnormal reply motion image to obtain target limb key points comprises:
determining the initial limb key point of the abnormal reply motion image in the previous frame of the abnormal reply motion image as the related limb key point of the abnormal reply motion image;
and inputting the related limb key points into an action correction model to obtain target limb key points of the abnormal response action image.
5. The method according to claim 4, wherein the inputting the relevant limb key points into a motion correction model to obtain target limb key points of the abnormal answer motion image comprises:
inputting the related limb key points into a motion correction model so that the motion correction model calculates the similarity between the related limb key points and each group of preset standard limb key points;
and determining the standard limb key point indicated by the highest similarity in the plurality of similarities as the target limb key point of the abnormal reply action image.
6. An apparatus for generating a digital human video, comprising:
the system comprises a collecting unit, a processing unit and a processing unit, wherein the collecting unit is used for collecting a reply action video of a customer service worker, and the reply action video comprises a plurality of reply action images;
the input unit is used for permanently inputting the reply action video into a pre-trained action detection model to obtain initial limb key points corresponding to each frame of reply action image, wherein the initial limb key points are limb key points of a digital character model;
the detection unit is used for detecting an abnormal reply action image of which the initial limb key point does not accord with a preset condition in the multi-frame reply action image;
the processing unit is used for processing the initial limb key points of the abnormal response motion image to obtain target limb key points;
the input unit is further configured to input the target limb key point of the abnormal reply motion image and the initial limb key point of another reply motion image except the abnormal reply motion image into a pre-trained image generation model, so as to obtain a multi-frame reply limb image of the digital character model;
and the generating unit is used for generating a digital person reply video according to the multi-frame reply limb image.
7. The apparatus of claim 6, wherein the collecting unit is further configured to collect an audio response of the customer service person;
the input unit is further used for inputting the reply audio to a pre-trained tone migration model to obtain a target audio after tone conversion;
the input unit is further used for inputting the reply audio to a pre-trained mouth shape parameter model to obtain a target mouth shape matched with the reply audio;
the input unit is also used for inputting the reply mouth shape into a pre-trained image generation model to obtain a multi-frame reply mouth shape image of the digital character model;
the generating unit is further used for generating a digital person reply video according to the target audio, the multi-frame reply mouth shape image and the multi-frame reply limb image.
8. The device according to claim 6, wherein the detecting unit is specifically configured to determine whether the initial limb key points corresponding to each frame of the reply motion image are all within a legal working range;
or the like, or, alternatively,
judging whether the initial limb key points corresponding to each frame of the reply action image indicate legal actions or not;
if not, determining that the reply action image is an abnormal reply action image.
9. An apparatus for generating a digital human video, comprising:
the system comprises a central processing unit, a memory and an input/output interface;
the memory is a transient memory or a persistent memory;
the central processor is configured to communicate with the memory and execute the instructions in the memory to perform the method of any of claims 1 to 5.
10. A computer storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 5.
CN202111212350.0A 2021-10-18 2021-10-18 Method for generating digital human video and related equipment Pending CN113868472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111212350.0A CN113868472A (en) 2021-10-18 2021-10-18 Method for generating digital human video and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111212350.0A CN113868472A (en) 2021-10-18 2021-10-18 Method for generating digital human video and related equipment

Publications (1)

Publication Number Publication Date
CN113868472A true CN113868472A (en) 2021-12-31

Family

ID=79000162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111212350.0A Pending CN113868472A (en) 2021-10-18 2021-10-18 Method for generating digital human video and related equipment

Country Status (1)

Country Link
CN (1) CN113868472A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726910A (en) * 2022-03-24 2022-07-08 中国银行股份有限公司 Customer service obtaining method and device, electronic equipment and computer storage medium
WO2023241289A1 (en) * 2022-06-13 2023-12-21 中兴通讯股份有限公司 Method and device for generating virtual reality service video, and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726910A (en) * 2022-03-24 2022-07-08 中国银行股份有限公司 Customer service obtaining method and device, electronic equipment and computer storage medium
WO2023241289A1 (en) * 2022-06-13 2023-12-21 中兴通讯股份有限公司 Method and device for generating virtual reality service video, and storage medium

Similar Documents

Publication Publication Date Title
CN110519636B (en) Voice information playing method and device, computer equipment and storage medium
CN112215927B (en) Face video synthesis method, device, equipment and medium
CN110072075B (en) Conference management method, system and readable storage medium based on face recognition
US11138903B2 (en) Method, apparatus, device and system for sign language translation
CN106682632B (en) Method and device for processing face image
CN111476871B (en) Method and device for generating video
CN110808034A (en) Voice conversion method, device, storage medium and electronic equipment
CN113868472A (en) Method for generating digital human video and related equipment
CN110071938B (en) Virtual image interaction method and device, electronic equipment and readable storage medium
EP3284249A2 (en) Communication system and method
EP4099709A1 (en) Data processing method and apparatus, device, and readable storage medium
EP1723637A1 (en) Speech receiving device and viseme extraction method and apparatus
CN111599359A (en) Man-machine interaction method, server, client and storage medium
CN114187547A (en) Target video output method and device, storage medium and electronic device
CN112911192A (en) Video processing method and device and electronic equipment
CN112420049A (en) Data processing method, device and storage medium
EP4207195A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
CN108364346B (en) Method, apparatus and computer readable storage medium for constructing three-dimensional face model
CN113823313A (en) Voice processing method, device, equipment and storage medium
CN115526772B (en) Video processing method, device, equipment and storage medium
CN111553854A (en) Image processing method and electronic equipment
CN114567693B (en) Video generation method and device and electronic equipment
CN114945108A (en) Method and device for assisting vision-impaired person in understanding picture
CN114138960A (en) User intention identification method, device, equipment and medium
CN111160051B (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination