WO2023231712A1 - 数字人驱动方法、数字人驱动设备及存储介质 - Google Patents

数字人驱动方法、数字人驱动设备及存储介质 Download PDF

Info

Publication number
WO2023231712A1
WO2023231712A1 PCT/CN2023/092794 CN2023092794W WO2023231712A1 WO 2023231712 A1 WO2023231712 A1 WO 2023231712A1 CN 2023092794 W CN2023092794 W CN 2023092794W WO 2023231712 A1 WO2023231712 A1 WO 2023231712A1
Authority
WO
WIPO (PCT)
Prior art keywords
digital human
image
motion
information
feature
Prior art date
Application number
PCT/CN2023/092794
Other languages
English (en)
French (fr)
Inventor
陆建国
石挺干
申光
李军
郑清芳
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023231712A1 publication Critical patent/WO2023231712A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Definitions

  • the present application relates to the field of digital human technology, in particular to a digital human driving method, digital human driving device and storage medium.
  • Virtual digital humans refer to comprehensive products that exist in the non-physical world, are created and used by computer means, and have multiple human characteristics (such as appearance characteristics, human performance capabilities, interaction capabilities, etc.).
  • digital people can be divided into 2D cartoon digital people, 2D real-life digital people, 3D cartoon digital people and 3D hyper-realistic digital people.
  • 2D real-life digital people have the characteristics of high fidelity and natural movements and expressions, so they have been widely used in film and television, media, education, finance and other fields.
  • Embodiments of the present application provide a digital human driving method, a digital human driving device and a storage medium.
  • embodiments of the present application provide a digital human driving method.
  • the method includes: collecting image information and audio information of a target object; performing identification and judgment on the image information and the audio information to obtain a judgment result; Perform feature extraction processing on the image information and/or the audio information according to the judgment result to obtain first motion features and/or second motion features; convert the first motion features and/or the second motion features into The characteristics and the digital human basic image are input to the character generator; the digital human basic image is driven through the character generator to output the first digital human driving image.
  • embodiments of the present application also provide a digital human driving device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • a digital human driving device including: a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program Implementation is as follows The digital human driven method described in one aspect.
  • embodiments of the present application also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the digital human driving method as described above.
  • Figure 1 is a schematic flowchart of a digital human driving method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of the method of step S130 in Figure 1;
  • FIG. 3 is a schematic flowchart of the method of step S140 in Figure 1;
  • FIG 4 is another schematic flowchart of the method of step S130 in Figure 1;
  • FIG. 5 is another schematic flowchart of the method of step S130 in Figure 1;
  • Figure 6 is a schematic flowchart of a digital human driving method provided by another embodiment of the present application.
  • Figure 7 is a schematic flowchart of a digital human driving method provided by another embodiment of the present application.
  • Figure 8 is a schematic flowchart of a digital human driving method provided by another embodiment of the present application.
  • Figure 9 is a schematic diagram of digital human driving work applied in a virtual anchor scenario provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of a digital human driving device provided by an embodiment of the present application.
  • This application provides a digital human driving method, a digital human driving device and a computer-readable storage medium.
  • the image information and audio information of the target object can be collected, the image information and audio information can be identified and judged, the judgment results can be obtained, and the image information and/or audio information can be characterized according to different judgment results.
  • Extraction processing obtaining the first motion feature and/or the second motion feature, driving the obtained first motion feature and/or the second motion feature and the digital human basic image through the character generator, and outputting the driven digital human image , can flexibly select the motion characteristics used to drive the digital human according to the collection situation of image information and audio information, and perform corresponding digital human driving processing based on the motion characteristics selected under different collection situations to obtain the representation effect Better numbers people.
  • Figure 1 is a schematic flowchart of a digital human driving method provided by an embodiment of the present application.
  • the digital human driving method includes but is not limited to step S110, step S120, step S130, step S140 and step S150.
  • Step S110 Collect image information and audio information of the target object.
  • the image information and audio information of the target object are collected through the information collection device.
  • a camera and a microphone are used to collect image information and audio information of the target object.
  • the target object is a real background anchor. This application does not place specific restrictions on the equipment used to collect image information and audio information.
  • Step S120 Perform recognition and judgment on the image information and audio information to obtain the judgment result.
  • the image information and audio information are identified and judged to obtain the judgment result.
  • the information collection equipment may collect image information and/or audio information of poor quality, and it is difficult to determine the quality of the information.
  • Effective motion features of the target object can be obtained from poor image information and/or audio information. For example, when the head posture in the collected image information exceeds a certain range, the generated image will be deformed to a certain extent, making it difficult to obtain effective motion features from the image information; or there will be large noise in the real scene, making the audio The information is contaminated by noise, making it difficult to obtain motion features from the collected audio information.
  • the posture judgment network can be used to detect the head posture of the driving figure. When the head posture exceeds a certain range, it will cause a certain deformation of the generated image. At this time, the quality of the image information collected is not high and is invalid. of.
  • the image information and audio information are identified and judged to obtain a judgment result on whether the image information and/or audio information is valid, so as to facilitate the subsequent selection of a valid driving mode for corresponding digital human driving processing.
  • the collected image information and audio information can also be pre-processed to improve the accuracy of the identification and judgment.
  • This application does not impose specific restrictions on the method used in the process of identifying and judging image information and audio information, as long as it can identify and judge image information and audio information, and obtain information on whether the image information and/or audio information is valid. Just judge the result.
  • Step S130 Perform feature extraction processing on the image information and/or audio information according to the judgment result to obtain the first motion feature and/or the second motion feature.
  • feature extraction processing is performed on the image information and/or audio information according to the judgment result to obtain the first motion feature and/or the second motion feature.
  • the first motion feature represents the first facial motion feature
  • the second motion feature represents the second facial motion feature.
  • before performing feature extraction processing on the image information and/or audio information according to the judgment results it further includes: obtaining an image-driven digital human network and a voice-driven digital human network. Then, feature extraction processing is performed on the image information through the image-driven digital human network to obtain the first motion feature, and feature extraction processing is performed on the audio information through the voice-driven digital human network to obtain the second motion feature.
  • the first motion feature and the second motion feature extracted through the image-driven digital human network and the voice-driven digital human network are located in the same feature space. That is to say, these two motion features represent the movement of the human face.
  • the description method is the same. For example, when the target object says the word "ah”, feature extraction is performed on the image information and voice information collected at this time.
  • the obtained first motion feature and the second motion feature can both characterize the target object saying the word "ah”. Movement state.
  • the image-driven digital human network may be a first-order motion model.
  • the first-order motion model is used to extract facial motion features from the image information.
  • voice information for digital human-driven processing a self-designed voice-driven digital human network can be used to extract facial motion features from audio information.
  • This application does not place specific restrictions on the generation methods of image-driven digital human networks and voice-driven digital human networks, as long as they can complete feature extraction processing.
  • Step S140 Input the first motion feature and/or the second motion feature and the digital human basic image to the character generator.
  • the first motion feature and/or the second motion feature, and the digital human basic image are input to the character generator. Since the first motion feature and the second motion feature are located in the same feature space and the motion states described by the two are consistent, the same generator can be used to perform subsequent synthetic image processing based on the first motion feature and/or the second motion feature. .
  • the digital human base image represents a reference image to be driven, and the digital human base image may be a human Images such as physical ID photos or portraits.
  • the network used in feature extraction processing and the generator that performs synthetic image processing based on motion features must be matched.
  • the image-driven digital human network can be the key point detector of the first-order motion model
  • the generator used should be the generator of the first-order motion model.
  • This application uses the image-driven digital human network and its matching generator. The model does not impose specific restrictions.
  • a facial key point detector (Practical Facial Landmark Detector, PFLD) can be used for feature extraction processing
  • a facial animation generator Neral Talking Heads) can be used as the generator in the decoder.
  • Step S150 Perform driving processing on the digital human basic image through the character generator, and output the first digital human driving image.
  • the digital human basic image is driven and processed through the character generator, and the first digital human driven image is output.
  • the basic image of the digital human is a virtual anchor image.
  • What is obtained through the processing of steps S110 to S150 is a single frame of the first digital human driven image.
  • multiple frame images can be obtained, that is, a frame sequence related to the digital human driving image can be obtained.
  • the image information and audio information of the target object can be collected, the image information and audio information can be identified and judged, and the judgment results can be obtained.
  • the image information and/or audio information can be Perform feature extraction processing to obtain the first motion feature and/or the second motion feature, drive the obtained first motion feature and/or the second motion feature and the digital human basic image through the character generator, and output the driven third
  • a digital human image can flexibly select the motion characteristics used to drive the digital human according to the collection situation of image information and audio information, and perform corresponding digital human driving processing based on the motion characteristics selected under different collection situations. To get digital people with better performance.
  • Step S130 Perform feature extraction processing on the image information and/or audio information according to the judgment result to obtain the first motion feature and/or the second motion feature, including but not limited to step S210:
  • Step S210 When the judgment result is that both the image information and the audio information are valid, perform feature extraction processing on the image information and audio information respectively to obtain the first motion feature and the second motion feature located in the same feature space as the first motion feature. .
  • step S140 input the first motion feature and/or the second motion feature, and the digital human basic image to the character generator, including but not limited to step S310 and step S320.
  • Step S310 Perform fusion feature processing according to the preset weighted fusion coefficient, the first motion feature, and the second motion feature to obtain the fusion motion feature.
  • fusion feature processing is performed based on the preset weighted fusion coefficient, the first motion feature, and the second motion feature to obtain the fusion motion feature. Since the first motion feature extracted from the image information and the second motion feature extracted from the audio information are located in the same feature space, the two motion features can be weighted to obtain the fused motion feature.
  • the fused motion feature is useful for the character motion feature.
  • the representation is more accurate. For example, when the mouth of the live anchor in the background is blocked by his hand, the mouth shape of the character in the generated image cannot be accurately generated. At this time, the second motion feature extracted from the voice information can effectively compensate for the inaccuracy of the mouth shape.
  • F is the fused motion feature
  • a is the preset weighted fusion coefficient
  • F1 is the first motion feature
  • F2 is the second motion feature
  • the value range of the preset weighted fusion coefficient a should be Between 0 and 1. It can be understood that the specific value of the preset weighted fusion coefficient can be set according to actual synthesis requirements, and this application does not impose specific restrictions on this.
  • Step S320 Input the fused motion features and the digital human basic image to the character generator.
  • the fused motion features and the digital human basic image are input to the character generator.
  • the character generator Based on the fusion of motion features and digital human base images, the character generator is able to synthesize the first digital human-driven images that more accurately represent real people.
  • feature fusion processing is performed on the multiple modal data to obtain fused motion features, and the fused motion features are used to generate a more accurate representation.
  • video information may be missing data in some areas due to problems such as human occlusion, making it difficult to generate accurate digital people.
  • problems such as human occlusion, making it difficult to generate accurate digital people.
  • the mouth of a character in the image information used for driving is blocked, and the image-driven digital human network cannot estimate the movement of the mouth.
  • feature extraction processing can be performed on the audio information to supplement the missing movement features of the mouth, which can improve Accuracy of generated digital figures.
  • Step S130 Perform feature extraction processing on the image information and/or audio information according to the judgment result to obtain the first motion feature and/or the second motion feature, including but not limited to step S410:
  • Step S410 If the judgment result is that the image information is valid and the audio information is invalid, perform feature extraction processing on the image information to obtain the first motion feature.
  • step S130 when the judgment result is that the image information is valid and the audio information is invalid, feature extraction processing is performed on the image information through the image-driven digital human network to obtain the first motion feature.
  • step S130 after obtaining the first motion feature, the first motion feature and/or the second motion feature, and the digital human basic image are input to the character generator, that is, step S130 also includes: combining the first motion feature and the digital human The base image is input to the character generator.
  • the character generator performs subsequent synthetic image processing based on the first motion features and the digital human basic image.
  • the solution of this application can still perform digital human-driven processing based on image information, and can cope with some emergencies in actual application scenarios of digital humans and ensure the work of digital humans in actual application scenarios. proceed normally.
  • Step S130 Perform feature extraction processing on the image information and/or audio information according to the judgment result to obtain the first motion feature and/or the second motion feature, including but not limited to step S510.
  • Step S510 If the judgment result is that the image information is invalid and the audio information is valid, perform feature extraction processing on the audio information to obtain the second motion feature.
  • step S130 when the judgment result is that the image information is invalid and the audio information is valid, feature extraction processing is performed on the audio information through the audio-driven digital human network to obtain the second motion feature.
  • step S130 after obtaining the second motion features, the first motion features and/or the second motion features, and the digital human basic image are input to the character generator, that is, step S130 also includes: combining the second motion features and the digital human The base image is input to the character generator. The character generator performs subsequent synthetic image processing based on the second motion features and the digital human basic image.
  • the solution of this application can still perform digital human-driven processing based on voice information, and can cope with some emergencies in actual application scenarios of digital humans and ensure the work of digital humans in actual application scenarios. proceed normally.
  • Figure 6 is a schematic flowchart of a digital human driving method provided by another embodiment of the present application.
  • the digital human driving method also includes step S610 and step S620:
  • Step S610 When the image information and audio information of the target object are not collected, or the judgment result is that both the image information and the audio information are invalid, obtain the preset action sequence for feature extraction to obtain the third motion feature.
  • a preset action sequence is obtained for feature extraction, and the third motion feature is obtained.
  • the preset action sequence can be one or more expression states, such as smiling, mouth opening and closing, etc., which can ensure that the digital human image sequence can drive the display normally when neither image information nor audio information is available.
  • Step S620 Input the third motion characteristics and the digital human basic image to the character generator, perform driving processing on the digital human basic image through the character generator, and output the first digital human driving image.
  • the third motion feature and the digital human basic image are input to the character generator, and the digital human basic image is driven through the character generator to output the first digital human driven image.
  • the character generator performs subsequent synthetic image processing based on the third motion features and the digital human basic image.
  • FIG. 7 is a schematic flowchart of a digital human driving method provided by another embodiment of the present application.
  • the digital human driving method also includes step S710, step S720 and step S730.
  • Step S710 Determine the first driving mode information according to the first digital human driving image.
  • the first driving modal information is determined based on the first digital human driving image, and the first driving modal information is the modal information corresponding to the fused motion feature or the first motion feature or the second motion feature or the third motion feature. .
  • Step S720 Determine the second driving mode information according to the second digital human driving image, and the second digital human driving image is the previous frame image of the first digital human driving image.
  • the second driving mode information is determined based on the second digital human driving image, the second digital human driving image is the previous frame image of the first digital human driving image, and the second driving mode information is the fused motion feature or the third Modal information corresponding to the first motion feature, the second motion feature, or the third motion feature.
  • Step S730 When the first driving mode information is different from the second driving mode information, interpolation processing is performed according to the motion characteristics of the first digital human driving image and the motion characteristics of the second digital human driving image to obtain the digital human transition driver image.
  • the first driving mode information and the second driving mode information are compared and judged, and interpolation processing is performed based on the motion characteristics of the first digital human driving image and the motion characteristics of the second digital human driving image to obtain the digital human transition driver image.
  • the image information collected is a continuous frame
  • the audio information collected is an audio stream.
  • the image information of multiple consecutive frames may be unavailable or lost.
  • digital human driver processing needs to be performed based on the available audio information to generate the next frame of digital human image. To ensure the normal progress of digital human drive.
  • the motion characteristics of the first digital human driven image and the motion characteristics of the second digital human driven image Perform interpolation processing to obtain transition motion features; after generating a digital human transition drive image based on the transition motion features, display the digital human transition drive image within a preset transition time so that the second digital human image of the previous frame smoothly transitions to the first digital human image. .
  • the driving modes involved in this application include: image information, voice information, preset action sequences, image information and voice information, and the mode switching process can be performed in any two driving modes, such as driving smoothing based on image information. Switch to driving based on voice information, or smoothly switch from driving based on preset action sequences to driving based on image information, etc.
  • text information can be converted into voice information through a voice converter, and the driving mode of the present application can also include text information.
  • the preset transition time can be set to about 0.5 seconds to 1 second; fitting interpolation and neural network interpolation methods can be used for interpolation processing. This application does not impose specific restrictions on the methods used in the interpolation processing process. .
  • Embodiments of the present application can enable the coexistence of multiple modal data, that is, the presence of image information and voice information, and can adjust the driving mode used to drive the digital human ( Image information, or voice information, or preset action sequences, or image information and voice information) can be flexibly selected, and corresponding digital human drive processing can be performed based on the drive modes selected under different acquisition situations to obtain a better representation effect.
  • Digital people When audio information or image information is unavailable or inappropriate for use, embodiments of the present application can switch to other available driving modes to drive the digital human, which can make the driven digital human display coherent.
  • the digital human generated based on the embodiments of the present application not only has a strong sense of interaction driven by image information, but also has the advantages of stable image generation driven by voice.
  • Figure 8 is a schematic flowchart of a digital human driving method provided by another embodiment of the present application, which also includes step S810, step S820 and step S830.
  • Step S810 Perform audio and video synchronization processing on the first digital human driven image to obtain a virtual anchor video
  • Step S820 Splice the virtual anchor video and the real anchor video, and output the target video
  • Step S830 Push the target video to the client.
  • step S830 After the first digital human driven image is obtained, audio and video synchronization processing is performed on the first digital human driven image to obtain a virtual anchor video; and then the virtual anchor video and the real anchor video are spliced.
  • the virtual anchor image can be segmented into portraits and the characters can be integrated into the video of the real anchor, so that the state of the digital person is closer to that of the real anchor and improves the realism of the digital person.
  • FIG. 9 is a schematic diagram of digital human driving operation applied in a virtual anchor scenario provided by an embodiment of the present application.
  • Figure 9 illustrates the processing flow of a single-frame digital human-driven image.
  • the frame sequence of the digital human-driven image can be obtained after multiple processes.
  • the information collection device collects image information and audio information from the background live anchor. When both the image information and the audio information are valid, the image-driven digital human network audio driver and the digital human network collect the image information and digital human network respectively.
  • the audio information is subjected to feature extraction processing to obtain the first motion feature and the second motion feature.
  • the first motion feature and the second motion feature are weighted to obtain the fused motion feature.
  • the fused motion feature and the virtual anchor image are input into the character generator.
  • the character generator performs digital human driven processing, outputs the driven virtual anchor, and then integrates the driven virtual anchor with the real anchor and pushes the stream to the client.
  • the digital human driving method of the embodiment of the present application can be applied not only to the virtual anchor scenario, but also to the video conferencing scenario and the virtual guest application scenario.
  • the image-driven digital human network for image feature extraction and the audio-driven digital human network for voice feature extraction can be deployed in the transmission of video conferences.
  • the end serves as the video encoder, and the generator is deployed on the receiving end as the video decoder. Since the extracted motion feature representation is a real-time compact motion representation, it can significantly reduce the bandwidth of video conference communication and improve user experience in weak network environments. Use experience.
  • FIG 10 is a schematic diagram of a digital human driving device provided by an embodiment of the present application.
  • the digital human driving device 1000 in the embodiment of the present application includes one or more control processors 1010 and memories 1020.
  • one control processor 1010 and one memory 1020 are taken as an example.
  • the control processor 1010 and the memory 1020 may be connected through a bus or other means. In FIG. 10 , the connection through a bus is taken as an example.
  • the memory 1020 can be used to store non-transitory software programs and non-transitory computer executable programs.
  • memory 1020 may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • the memory 1020 may include memories 1020 located remotely relative to the control processor 1010 , and these remote memories 1020 may be connected to the digital human driving device 1000 through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the device structure shown in Figure 10 does not constitute a limitation on the digital human driving device 1000, and may include more or less components than shown in the figure, or combine certain components, or different components. layout.
  • the non-transient software programs and instructions required to implement the digital human driving method applied to the digital human driving device 1000 in the above embodiment are stored in the memory 1020.
  • the digital human driving method applied to the digital human driving device 1000 in the above embodiment is executed.
  • the digital human driving method of the human driving device 1000 for example, performs the above-described method steps S110 to S150 in FIG. 1, method step S210 in FIG. 2, method steps S310 to step S320 in FIG. 3, and method steps S320 in FIG. 4.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • one embodiment of the present application also provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are executed by one or more control processors, for example, by Execution of one control processor 1010 in FIG. 10 may cause the one or more control processors 1010 to execute the control method in the above method embodiment, for example, execute the above-described method steps S110 to S150 in FIG. 1 , FIG.
  • the method step S210 in 2 the method step S310 to step S320 in Fig. 3, the method step S410 in Fig. 4, the method step S510 in Fig. 5, the method step S610 to step S620 in Fig. 6, the method in Fig. 7 Steps S710 to S730 and steps S810 to S830 of the method in FIG. 8 .
  • Embodiments of the present application include: collecting image information and audio information of the target object; performing identification and judgment on the image information and audio information to obtain a judgment result; performing feature extraction processing on the image information and/or audio information according to the judgment result to obtain the first motion features and/or second motion features; input the first motion features and/or second motion features, and the digital human basic image to the character generator; drive the digital human basic image through the character generator to output the first digital human driver image.
  • the solution of the embodiment of the present application can collect the image information and audio information of the target object, identify and judge the image information and audio information, obtain the judgment results, and perform feature extraction on the image information and/or audio information according to different judgment results.
  • Process to obtain the first motion feature and/or the second motion feature, drive the obtained first motion feature and/or the second motion feature and the digital human basic image through the character generator, and output the driven digital human image can be based on According to the collection situation of image information and audio information, the motion characteristics used to drive the digital human should be flexibly selected, and corresponding digital human driving processing should be performed based on the motion characteristics selected under different collection situations to obtain a better representation effect. Digital people.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请公开了一种数字人驱动方法、数字人驱动设备及存储介质,其中,数字人驱动方法包括:采集目标对象的图像信息和音频信息(S110);对图像信息和音频信息进行识别判断,得到判断结果(S120);根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征(S130);将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器(S140);通过人物生成器对数字人基础图像进行驱动处理,输出第一数字人驱动图像(S150)。

Description

数字人驱动方法、数字人驱动设备及存储介质
相关申请的交叉引用
本申请基于申请号为202210599184.2、申请日为2022年05月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及数字人技术领域,尤其是一种数字人驱动方法、数字人驱动设备及存储介质。
背景技术
随着元宇宙概念的兴起,作为元宇宙重要的载体,数字人技术备受关注。整个数字人行业也在飞速发展。虚拟数字人指存在于非物理世界中,由计算机手段创造及使用,并具有多重人类特征(例如:外貌特征、人类表演能力、交互能力等)的综合产物。根据任务形象的维度,数字人可分为2D卡通数字人、2D真人数字人、3D卡通数字人以及3D超写实数字人。其中,2D真人数字人具有逼真性高、动作表情自然的特点,因此在影视、传媒、教育、金融等领域得到了广泛的应用。
相关技术中,仅支持基于图像、语音或文字的单一模态数据驱动数字人,即使存在多种模态数据,在一些情形中也只能选择其中一种模态数据进行驱动数字人。其中,基于图像驱动数字人时,对于对象人物的姿态要求严格,时常会因对象人物离开摄像头画面、或者因人物姿态过大、面部情况不清晰,而导致无法有效地对数字人进行驱动;基于文字驱动数字人时,时常会将文字转换为语音后再基于语音驱动数字人,而基于语音驱动数字人虽然实现起来较为可靠,但生成的数字人会存在交互性弱的问题,在虚拟数字人与真人需要交互的场景中,使用传统的语音驱动数字人很难满足互动要求;无法灵活地使用多种模态数据进行数字人驱动处理,在实际应用中在一些突发情况的影响下,模态数据被破坏,在显示数字人时会出现画面跳变的情况,导致合成的数字人的真实性下降,降低用户体验。如何更有效地驱动数字人得到表示效果更好的数字人,是一个亟待解决的问题。
发明内容
本申请实施例提供了一种数字人驱动方法、数字人驱动设备及存储介质。
第一方面,本申请实施例提供了一种数字人驱动方法,所述方法包括:采集目标对象的图像信息和音频信息;对所述图像信息和所述音频信息进行识别判断,得到判断结果;根据所述判断结果对所述图像信息和/或所述音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征;将所述第一运动特征和/或所述第二运动特征、数字人基础图像输入至人物生成器;通过所述人物生成器对所述数字人基础图像进行驱动处理,输出第一数字人驱动图像。
第二方面,本申请实施例还提供了一种数字人驱动设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第 一方面所述的数字人驱动方法。
第三方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如上所述的数字人驱动方法。
附图说明
图1是本申请一个实施例提供的数字人驱动方法的流程示意图;
图2是图1中步骤S130的方法的流程示意图;
图3是图1中步骤S140的方法的流程示意图;
图4是图1中步骤S130的方法的另一个流程示意图;
图5是图1中步骤S130的方法的另一个流程示意图;
图6是本申请另一个实施例提供的数字人驱动方法的流程示意图;
图7是本申请另一个实施例提供的数字人驱动方法的流程示意图;
图8是本申请另一个实施例提供的数字人驱动方法的流程示意图
图9是本申请一个实施例提供的应用于虚拟主播场景中的数字人驱动工作示意图;
图10是本申请一个实施例提供的数字人驱动设备的示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,在本申请的描述中,若干的含义是一个或者多个,多个的含义是两个以上,大于、小于、超过等理解为不包括本数。此外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于流程图中的顺序执行所示出或描述的步骤。说明书的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请提供了一种数字人驱动方法、数字人驱动设备及计算机可读存储介质。通过本申请实施例的方案,能够通过采集目标对象的图像信息和音频信息,对图像信息和音频信息进行识别判断,得到判断结果,根据不同的判断结果,对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,通过人物生成器对得到的第一运动特征和/或第二运动特征、数字人基础图像进行驱动处理,输出驱动后的数字人图像,能够根据图像信息和音频信息的采集情况,对所采用的用于驱动数字人的运动特征进行灵活地选用,基于不同采集情况下选用的运动特征进行对应的数字人驱动处理,以得到表示效果更好的数字人。
下面结合附图,对本申请实施例作进一步阐述。
参照图1,图1是本申请一个实施例提供的数字人驱动方法的流程示意图,该数字人驱动方法包括但不限于有步骤S110、步骤S120、步骤S130、步骤S140和步骤S150。
步骤S110:采集目标对象的图像信息和音频信息。
本步骤中,通过信息采集设备采集目标对象的图像信息和音频信息。在一实施例中,利用摄像头和麦克风采集目标对象的图像信息和音频信息。当该数字人驱动方法应用于虚拟主播场景中,则目标对象为后台真人主播。本申请对采集图像信息和音频信息所使用的设备不做具体的限制。
步骤S120:对图像信息和音频信息进行识别判断,得到判断结果。
本步骤中,对图像信息和音频信息进行识别判断,得到判断结果。可以理解的是,实际的应用场景较为复杂,而目标对象的行为动作也存在一定的变动,这意味着信息采集设备采集可能会采集到质量不佳的图像信息和/或音频信息,难以从质量不佳的图像信息和/或音频信息中获取到有效的关于目标对象的运动特征。例如,当采集的图像信息中的头部姿态超过一定的范围,会导致生成图像出现一定的变形,难以从图像信息中获取到有效的运动特征;或者真实场景中出现较大的噪声,使得音频信息被噪声污染,难以从采集的音频信息中获取到运动特征。在一实施例中,可以使用姿态判断网络检测驱动人物的头部姿态,当头部姿态超过一定的范围,会导致生成图像出现一定的变形,此时采集到的图像信息质量不高,是无效的。对图像信息和音频信息进行识别判断,得到关于图像信息和/或音频信息是否有效的判断结果,以便于后续选取有效的驱动模态进行相应的数字人驱动处理。
可以理解的是,对图像信息和音频信息进行识别判断之前,还可以对采集到的图像信息和音频信息进行预处理,以便于提高识别判断的准确性。
本申请对针对图像信息和音频信息进行的识别判断过程中所采用的方法不做具体的限制,只要其能对图像信息和音频信息进行识别判断,得到关于图像信息和/或音频信息是否有效的判断结果即可。
步骤S130:根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征。
本步骤中,根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征。在一实施例中,第一运动特征表示第一人脸运动特征,第二运动特征表示第二人脸运动特征。可行的实施方式中,在根据判断结果对图像信息和/或音频信息进行特征提取处理前还包括:获取图像驱动数字人网络和语音驱动数字人网络。而后,通过图像驱动数字人网络对图像信息进行特征提取处理得到第一运动特征,通过语音驱动数字人网络对音频信息进行特征提取处理得到第二运动特征。在一实施例中,经过图像驱动数字人网络和语音驱动数字人网络提取到的第一运动特征和第二运动特征位于同一特征空间,即是说,这两个运动特征表示对人脸的运动描述方法是相同的。例如,当目标对象说“啊”字时,对此时采集的图像信息和语音信息进行特征提取,得到的第一运动特征和第二运动特征都可以表征目标对象说“啊”字的这一运动状态。
在一实施例中,图像驱动数字人网络可以为一阶运动模型,在使用图像信息进行数字人驱动处理的过程时,使用一阶运动模型从图像信息中提取人脸运动特征。在使用语音信息进行数字人驱动处理的过程时,可以使用自行设计的语音驱动数字人网络从音频信息中提取人脸运动特征。
本申请对图像驱动数字人网络和语音驱动数字人网络的生成方式不做具体的限制,只要其能完成特征提取处理即可。
步骤S140:将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器。
本步骤中,将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器。由于第一运动特征和第二运动特征位于同一特征空间,两者所描述的运动状态是一致的,因此可以采用同一生成器根据第一运动特征和/或第二运动特征进行后续的合成图像处理。
在一实施例中,数字人基础图像表示的是待驱动的基准图像,数字人基础图像可以是人 物证件照或是人物画像等图像。根据本申请可行的实施例,进行特征提取处理中所使用的网络和根据运动特征进行合成图像处理的生成器必须是配套的。当图像驱动数字人网络可以为一阶运动模型的关键点检测器,则采用的生成器应当是一阶运动模型的生成器,本申请对图像驱动数字人网络以及与其配套的生成器所采用的模型不做具体的限制。此外,例如,进行特征提取处理时可采用人脸关键点检测器(Practical Facial Landmark Detector,PFLD),而解码器中的生成器可采用人脸动画生成器(Neural Talking Heads)。
步骤S150:通过人物生成器对数字人基础图像进行驱动处理,输出第一数字人驱动图像。
本步骤中,通过人物生成器对数字人基础图像进行驱动处理,输出第一数字人驱动图像。其中,在虚拟主播应用场景中,数字人基础图像为虚拟的主播形象图。步骤S110至步骤S150处理得到的是单帧的第一数字人驱动图像。经过多个如步骤S110至步骤S150的处理可以得到多个帧图像,即得到关于数字人驱动图像的帧序列。
根据本申请图1所示的方法,能够通过采集目标对象的图像信息和音频信息,对图像信息和音频信息进行识别判断,得到判断结果,根据不同的判断结果,对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,通过人物生成器对得到的第一运动特征和/或第二运动特征、数字人基础图像进行驱动处理,输出驱动后的第一数字人图像,能够根据图像信息和音频信息的采集情况,对所采用的用于驱动数字人的运动特征进行灵活地选用,基于不同采集情况下选用的运动特征进行对应的数字人驱动处理,以得到表示效果更好的数字人。
在一实施例中,参照图2,图2是图1中步骤S130的方法的流程示意图。步骤S130:根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,包括但不限于有步骤S210:
步骤S210:在判断结果为图像信息和音频信息均有效的情况下,分别对图像信息和音频信息进行特征提取处理,得到第一运动特征和与第一运动特征位于同一特征空间的第二运动特征。
本步骤中,在判断结果为图像信息和音频信息均有效的情况下,通过图像驱动数字人网络和语音驱动数字人网络分别对图像信息和音频信息进行特征提取处理,得到第一运动特征和与第一运动特征位于同一特征空间的第二运动特征。
在一实施例中,参照图3,图3是图1中步骤S140的方法的流程示意图。在判断结果为图像信息和音频信息均有效的情况下,步骤S140:将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器,包括但不限于有步骤S310和步骤S320。
步骤S310:根据预设加权融合系数、第一运动特征、第二运动特征进行融合特征处理得到融合运动特征。
本步骤中,根据预设加权融合系数、第一运动特征、第二运动特征进行融合特征处理得到融合运动特征。由于将从图像信息中提取的第一运动特征和从音频信息中提取的第二运动特征位于同一特征空间,因此可以对两个运动特征进行加权处理得到融合运动特征,融合运动特征对于人物运动特征的表示更加准确。例如,后台真人主播的嘴部被手遮挡时,生成图像中人物的嘴型不能准确生成,这时从语音信息中提取的第二运动特征可以有效弥补口型的不准确。融合过程可以表示为:F=a*F1+(1-a)*F2。其中,F为融合运动特征,a为预设加权融合系数,F1为第一运动特征,F2为第二运动特征,预设加权融合系数a的取值范围应当 在0至1之间。可以理解的是,预设加权融合系数的具体数值可以根据实际的合成要求来进行设定,本申请对此不作具体的限制。
步骤S320:将融合运动特征与数字人基础图像输入至人物生成器。
本步骤中,得到融合运动特征后,将融合运动特征与数字人基础图像输入至人物生成器。人物生成器根据融合运动特征和数字人基础图像能够合成更准确地表示真人形象的第一数字人驱动图像。在多种模态数据并存,即存在图像信息和语音信息的情况下,对多种模态数据进行特征融合处理得到融合运动特征,利用融合运动特征生成更加精确的表示。
在一些场景中,视频信息可能由于人物遮挡等问题,导致部分区域数据缺失,难以生成精准的数字人。例如,用于驱动的图像信息中人物的嘴巴被遮挡,图像驱动数字人网络无法估计出嘴部的运动,这时可以对音频信息进行特征提取处理补充所缺失的嘴部的运动特征,可以提高生成的数字人的准确性。
在一实施例中,参照图4,图4是图1中步骤S130的方法的另一个流程示意图。步骤S130:根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,包括但不限于有步骤S410:
步骤S410:在判断结果为图像信息有效且音频信息无效的情况下,对图像信息进行特征提取处理,得到第一运动特征。
本步骤中,在判断结果为图像信息有效且音频信息无效的情况下,通过图像驱动数字人网络对图像信息进行特征提取处理,得到第一运动特征。在一实施例中,得到第一运动特征后,将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器,即步骤S130还包括:将第一运动特征和数字人基础图像输入至人物生成器。人物生成器根据第一运动特征和数字人基础图像进行后续的合成图像处理。在音频信息无效、无法使用的情况下,本申请的方案仍能基于图像信息进行数字人驱动处理,能够应对数字人实际应用场景中的一些突发情况,保证实际应用场景中的数字人的工作的正常进行。
在一实施例中,参照图5,图5是图1中步骤S130的方法的另一个流程示意图。步骤S130:根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,包括但不限于有步骤S510。
步骤S510:在判断结果为图像信息无效且音频信息有效的情况下,对音频信息进行特征提取处理,得到第二运动特征。
本步骤中,在判断结果为图像信息无效且音频信息有效的情况下,通过音频驱动数字人网络对音频信息进行特征提取处理,得到第二运动特征。在一实施例中,得到第二运动特征之后,将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器,即步骤S130还包括:将第二运动特征和数字人基础图像输入至人物生成器。人物生成器根据第二运动特征和数字人基础图像进行后续的合成图像处理。在图像信息无效、无法使用的情况下,本申请的方案仍能基于语音信息进行数字人驱动处理,能够应对数字人实际应用场景中的一些突发情况,保证实际应用场景中的数字人的工作的正常进行。
在一实施例中,参照图6,图6是本申请另一个实施例提供的数字人驱动方法的流程示意图,该数字人驱动方法还包括步骤S610和步骤S620:
步骤S610:在未采集到目标对象的图像信息和音频信息,或者判断结果为图像信息和音频信息均无效的情况下,获取预设动作序列进行特征提取,得到第三运动特征。
本步骤中,在未采集到目标对象的图像信息和音频信息,或者判断结果为图像信息和音频信息均无效的情况下,获取预设动作序列进行特征提取,得到第三运动特征。真实的应用场景中,可能存在信息采集设备部分故障或全部故障,而导致无法从信息采集设备中读取到关于目标对象的图像信息和音频信息的情况、或是采集得到的关于目标对象的图像信息和音频信息均无效,无法对其进行有效的特征提取处理时,则获取预设动作序列对其进行特征提取得到第三运动特征。预设动作序列可以是一个或多个表情状态,例如微笑、嘴巴开合等等,能够在图像信息和音频信息都无法使用的情况下,保证数字人图像序列能够正常驱动显示。
步骤S620:将第三运动特征和数字人基础图像输入至人物生成器,通过人物生成器对数字人基础图像进行驱动处理,输出第一数字人驱动图像。
本步骤中,在得到第三运动特征后,将第三运动特征和数字人基础图像输入至人物生成器,通过人物生成器对数字人基础图像进行驱动处理,输出第一数字人驱动图像。人物生成器根据第三运动特征和数字人基础图像进行后续的合成图像处理。在图像信息和语音信息都无效、无法使用的情况下,本申请的方案仍能基于预设动作序列进行数字人驱动处理,能够应对数字人实际应用场景中的一些突发情况,保证实际应用场景中的数字人的工作的正常进行。
在一实施例中,参照图7,图7是本申请另一个实施例提供的数字人驱动方法的流程示意图,该数字人驱动方法还包括步骤S710、步骤S720和步骤S730。
步骤S710:根据第一数字人驱动图像确定第一驱动模态信息。
本步骤中,根据第一数字人驱动图像确定第一驱动模态信息,第一驱动模态信息为融合运动特征或者第一运动特征或者第二运动特征或者第三运动特征所对应的模态信息。
步骤S720:根据第二数字人驱动图像确定第二驱动模态信息,第二数字人驱动图像为第一数字人驱动图像的上一帧图像。
本步骤中,根据第二数字人驱动图像确定第二驱动模态信息,第二数字人驱动图像为第一数字人驱动图像的上一帧图像,第二驱动模态信息为融合运动特征或者第一运动特征或者第二运动特征或者第三运动特征所对应的模态信息。
步骤S730:在第一驱动模态信息与第二驱动模态信息不同的情况下,根据第一数字人驱动图像的运动特征和第二数字人驱动图像的运动特征进行插值处理,得到数字人过渡驱动图像。
本步骤中,对第一驱动模态信息与第二驱动模态信息进行对比判断,根据第一数字人驱动图像的运动特征和第二数字人驱动图像的运动特征进行插值处理,得到数字人过渡驱动图像。可以理解的是,对目标对象进行采集时,采集的图像信息是连续多帧的,采集的音频信息则为音频流。在多种模态数据存在的情况下,可能有连续多帧的图像信息不可用或是丢失的情况发生,此时,需要基于可用的音频信息进行数字人驱动处理生成下一帧数字人图像,以确保数字人驱动的正常进行。在当前帧数字人图像和前一帧数字人图像所使用的驱动模态不一样的情况下,在实际场景中进行数字人显示时,会出现图像跳变的情况,使得数字人显示不流畅,影响客户体验。因此,当第一驱动模态信息与第二驱动模态信息不相同时,需要进行模态切换处理,有利于使前一帧数字人图像平滑过渡至当前帧数字人图像,提高显示的数字人的真实度。
在一实施例中,根据第一数字人驱动图像的运动特征和第二数字人驱动图像的运动特征 进行插值处理得到过渡运动特征;根据过渡运动特征生成数字人过渡驱动图像之后,在预设过渡时间内显示数字人过渡驱动图像使得上一帧的第二数字人图像平滑过渡至第一数字人图像。
本申请中涉及的驱动模态包括:图像信息、语音信息、预设动作序列、图像信息和语音信息,而模态切换处理可以在任意两种驱动模态中进行,例如从基于图像信息驱动平滑切换至基于语音信息驱动,或是从基于预设动作序列平滑切换至基于图像信息驱动等等。
可以理解的是,文字信息可以通过语音转换器转换为语音信息,本申请的驱动模态也可以包括文字信息。
在一实施例中,预设过渡时间可以设置为0.5秒到1秒左右;可以采用拟合插值和神经网络插值方法进行插值处理,本申请对插值处理过程中所采用的方式不做具体的限制。
本申请的实施例能够在多种模态数据并存,即存在图像信息和语音信息的情况下,能够根据图像信息和音频信息的采集情况,对所采用的用于驱动数字人的驱动模态(图像信息、或语音信息、或预设动作序列、或图像信息和语音信息)进行灵活地选用,基于不同采集情况下选用的驱动模态进行对应的数字人驱动处理,以得到表示效果更好的数字人。当出现音频信息或者图像信息不可用或者不适宜采用的情况,本申请实施例可以切换到其他可用的驱动模态进行数字人驱动,能够使驱动数字人显示连贯。基于本申请实施例生成的数字人既拥有基于图像信息驱动的较强的互动感,也拥有基于语音驱动的生成图像稳定的优点。
在一实施例中,当应用于主播场景中,可以进行如图8所示的处理流程使虚拟主播和真人主播一起同台播出。参照图8,图8是本申请另一个实施例提供的数字人驱动方法的流程示意图,还包括有步骤S810、步骤S820和步骤S830。
步骤S810:将第一数字人驱动图像进行音视频同步处理,得到虚拟主播视频;
步骤S820:将虚拟主播视频和真人主播视频进行拼接处理,输出目标视频;
步骤S830:将目标视频推流至客户端。
通过步骤S810至步骤S830的处理,在得到第一数字人驱动图像后,对第一数字人驱动图像进行音视频同步处理,得到虚拟主播视频;而后将虚拟主播视频和真人主播视频进行拼接处理,得到目标视频;将目标视频推流至客户端,用户可以在客户端看到基于真人形象而驱动的2D数字人。在实际操作中,可以将虚拟主播图像进行人像分割,并将人物融合到真人主播的视频中,以使数字人呈现的状态更接近与真人主播,提高数字人的真实感。
参照图9,图9是本申请一个实施例提供的应用于虚拟主播场景中的数字人驱动工作示意图。图9示意的是单帧数字人驱动图像的处理流程,在虚拟主播场景中,经过多次的处理就可以得到数字人驱动图像的帧序列。在一实施例中,信息采集设备对后台真人主播进行图像信息和音频信息的采集,在图像信息和音频信息均有效的情况下,图像驱动数字人网络音频驱动和数字人网分别对图像信息和音频信息进行特征提取处理,得到第一运动特征和第二运动特征,对第一运动特征和第二运动特征进行加权处理得到融合运动特征,将融合运动特征和虚拟主播形象输入人物生成器中,人物生成器进行数字人驱动处理,输出驱动后的虚拟主播,而后将驱动后的虚拟主播与真人主播融合后推流到客户端。
在一实施例中,本申请实施例的数字人驱动方法除了可以应用于虚拟主播场景,还可以应用于视频会议场景中、以及虚拟客应用场景中。在实际的部署中,可以将用于图像特征提取的图像驱动数字人网络和用于语音特征提取的音频驱动数字人网络部署在视频会议的发送 端作为视频的编码器,将生成器部署在接收端作为视频的解码器,由于提取的运动特征表示是一个实时紧凑的运动表示,可以大幅降低视频会议通讯的带宽,改善弱网环境下用户的使用体验。
参照图10,图10是本申请一个实施例提供的数字人驱动设备的示意图。本申请实施例的数字人驱动设备1000,包括一个或多个控制处理器1010和存储器1020,图10中以一个控制处理器1010及一个存储器1020为例。控制处理器1010和存储器1020可以通过总线或者其他方式连接,图10中以通过总线连接为例。
存储器1020作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器1020可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器1020可包括相对于控制处理器1010远程设置的存储器1020,这些远程存储器1020可以通过网络连接至该数字人驱动设备1000。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本领域技术人员可以理解,图10中示出的装置结构并不构成对数字人驱动设备1000的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
实现上述实施例中应用于数字人驱动设备1000的数字人驱动方法所需的非暂态软件程序以及指令存储在存储器1020中,当被控制处理器1010执行时,执行上述实施例中应用于数字人驱动设备1000的数字人驱动方法,例如,执行以上描述的图1中的方法步骤S110至步骤S150、图2中的方法步骤S210、图3中的方法步骤S310至步骤S320、图4中的方法步骤S410、图5中的方法步骤S510、图6中的方法步骤S610至步骤S620、图7中的方法步骤S710至步骤S730及图8中的方法步骤S810至步骤S830。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个控制处理器执行,例如,被图10中的一个控制处理器1010执行,可使得上述一个或多个控制处理器1010执行上述方法实施例中的控制方法,例如,执行以上描述的图1中的方法步骤S110至步骤S150、图2中的方法步骤S210、图3中的方法步骤S310至步骤S320、图4中的方法步骤S410、图5中的方法步骤S510、图6中的方法步骤S610至步骤S620、图7中的方法步骤S710至步骤S730及图8中的方法步骤S810至步骤S830。
本申请实施例包括:采集目标对象的图像信息和音频信息;对图像信息和音频信息进行识别判断,得到判断结果;根据判断结果对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征;将第一运动特征和/或第二运动特征、数字人基础图像输入至人物生成器;通过人物生成器对数字人基础图像进行驱动处理,输出第一数字人驱动图像。本申请实施例的方案,能够通过采集目标对象的图像信息和音频信息,对图像信息和音频信息进行识别判断,得到判断结果,根据不同的判断结果,对图像信息和/或音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,通过人物生成器对得到的第一运动特征和/或第二运动特征、数字人基础图像进行驱动处理,输出驱动后的数字人图像,能够根据 图像信息和音频信息的采集情况,对所采用的用于驱动数字人的运动特征进行灵活地选用,基于不同采集情况下选用的运动特征进行对应的数字人驱动处理,以得到表示效果更好的数字人。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。

Claims (11)

  1. 一种数字人驱动方法,所述方法包括:
    采集目标对象的图像信息和音频信息;
    对所述图像信息和所述音频信息进行识别判断,得到判断结果;
    根据所述判断结果对所述图像信息和/或所述音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征;
    将所述第一运动特征和/或所述第二运动特征、数字人基础图像输入至人物生成器;
    通过所述人物生成器对所述数字人基础图像进行驱动处理,输出第一数字人驱动图像。
  2. 根据权利要求1所述的数字人驱动方法,其中,所述根据所述判断结果对所述图像信息和/或所述音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,包括:
    在所述判断结果为所述图像信息和所述音频信息均有效的情况下,分别对所述图像信息和所述音频信息进行特征提取处理,得到所述第一运动特征和与所述第一运动特征位于同一特征空间的所述第二运动特征。
  3. 根据权利要求2所述的数字人驱动方法,其中,所述将所述第一运动特征和/或所述第二运动特征、数字人基础图像输入至人物生成器,还包括:
    根据预设加权融合系数、所述第一运动特征、所述第二运动特征进行融合特征处理得到融合运动特征;
    将所述融合运动特征与所述数字人基础图像输入至所述人物生成器。
  4. 根据权利要求1所述的数字人驱动方法,其中,所述根据所述判断结果对所述图像信息和/或所述音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,包括:
    在所述判断结果为所述图像信息有效且所述音频信息无效的情况下,对所述图像信息进行特征提取处理,得到所述第一运动特征。
  5. 根据权利要求4所述的数字人驱动方法,其中,所述将所述第一运动特征和/或所述第二运动特征、数字人基础图像输入至人物生成器,还包括:
    将所述第一运动特征和所述数字人基础图像输入至所述人物生成器。
  6. 根据权利要求1所述的数字人驱动方法,其中,所述根据所述判断结果对所述图像信息和/或所述音频信息进行特征提取处理,得到第一运动特征和/或第二运动特征,包括:
    在所述判断结果为所述图像信息无效且所述音频信息有效的情况下,对所述音频信息进行特征提取处理,得到所述第二运动特征。
  7. 根据权利要求6所述的数字人驱动方法,其中,所述将所述第一运动特征和/或所述第二运动特征、数字人基础图像输入至人物生成器,还包括:
    将所述第二运动特征和所述数字人基础图像输入至所述人物生成器。
  8. 根据权利要求1所述的数字人驱动方法,还包括:
    在未采集到所述目标对象的所述图像信息和所述音频信息,或者所述判断结果为所述图像信息和所述音频信息均无效的情况下,获取预设动作序列进行特征提取,得到第三运动特征;
    将所述第三运动特征和所述数字人基础图像输入至所述人物生成器,通过所述人物生成 器对所述数字人基础图像进行驱动处理,输出所述第一数字人驱动图像。
  9. 根据权利要求1或者3或者5或者7或者8所述的数字人驱动方法,还包括:
    根据所述第一数字人驱动图像确定第一驱动模态信息;
    根据第二数字人驱动图像确定第二驱动模态信息,所述第二数字人驱动图像为所述第一数字人驱动图像的上一帧图像;
    在所述第一驱动模态信息与所述第二驱动模态信息不同的情况下,根据所述第一数字人驱动图像的运动特征和所述第二数字人驱动图像的运动特征进行插值处理,得到数字人过渡驱动图像。
  10. 一种数字人驱动设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至9任意一项所述的数字人驱动方法。
  11. 一种计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1至9任意一项所述的数字人驱动方法。
PCT/CN2023/092794 2022-05-30 2023-05-08 数字人驱动方法、数字人驱动设备及存储介质 WO2023231712A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210599184.2 2022-05-30
CN202210599184.2A CN117197308A (zh) 2022-05-30 2022-05-30 数字人驱动方法、数字人驱动设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023231712A1 true WO2023231712A1 (zh) 2023-12-07

Family

ID=88991116

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092794 WO2023231712A1 (zh) 2022-05-30 2023-05-08 数字人驱动方法、数字人驱动设备及存储介质

Country Status (2)

Country Link
CN (1) CN117197308A (zh)
WO (1) WO2023231712A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218842A (zh) * 2013-03-12 2013-07-24 西南交通大学 一种语音同步驱动三维人脸口型与面部姿势动画的方法
US20160292898A1 (en) * 2015-03-30 2016-10-06 Fujifilm Corporation Image processing device, image processing method, program, and recording medium
CN111862277A (zh) * 2020-07-22 2020-10-30 北京百度网讯科技有限公司 用于生成动画的方法、装置、设备以及存储介质
CN113886641A (zh) * 2021-09-30 2022-01-04 深圳追一科技有限公司 数字人生成方法、装置、设备及介质
CN115497150A (zh) * 2022-10-21 2022-12-20 小哆智能科技(北京)有限公司 虚拟主播视频生成方法、装置、电子设备及存储介质
CN116137673A (zh) * 2023-02-22 2023-05-19 广州欢聚时代信息科技有限公司 数字人表情驱动方法及其装置、设备、介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218842A (zh) * 2013-03-12 2013-07-24 西南交通大学 一种语音同步驱动三维人脸口型与面部姿势动画的方法
US20160292898A1 (en) * 2015-03-30 2016-10-06 Fujifilm Corporation Image processing device, image processing method, program, and recording medium
CN111862277A (zh) * 2020-07-22 2020-10-30 北京百度网讯科技有限公司 用于生成动画的方法、装置、设备以及存储介质
CN113886641A (zh) * 2021-09-30 2022-01-04 深圳追一科技有限公司 数字人生成方法、装置、设备及介质
CN115497150A (zh) * 2022-10-21 2022-12-20 小哆智能科技(北京)有限公司 虚拟主播视频生成方法、装置、电子设备及存储介质
CN116137673A (zh) * 2023-02-22 2023-05-19 广州欢聚时代信息科技有限公司 数字人表情驱动方法及其装置、设备、介质

Also Published As

Publication number Publication date
CN117197308A (zh) 2023-12-08

Similar Documents

Publication Publication Date Title
CN112562433B (zh) 一种基于全息终端的5g强互动远程专递教学系统的工作方法
CN110446000B (zh) 一种生成对话人物形象的方法和装置
JP2022528294A (ja) 深度を利用した映像背景減算法
US11551393B2 (en) Systems and methods for animation generation
CN110401810B (zh) 虚拟画面的处理方法、装置、系统、电子设备及存储介质
JP2009510877A (ja) 顔検出を利用したストリーミングビデオにおける顔アノテーション
CN113973190A (zh) 视频虚拟背景图像处理方法、装置及计算机设备
CN112633208A (zh) 一种唇语识别方法、服务设备及存储介质
US20230047858A1 (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program product for video communication
WO2017094527A1 (ja) 動画生成システムおよび動画表示システム
WO2023221684A1 (zh) 数字人生成方法和装置及存储介质
US7257538B2 (en) Generating animation from visual and audio input
CN112601120B (zh) 字幕显示方法及装置
US10224073B2 (en) Auto-directing media construction
CN113709545A (zh) 视频的处理方法、装置、计算机设备和存储介质
CN114286021B (zh) 渲染方法、装置、服务器、存储介质及程序产品
CN113395569B (zh) 视频生成方法及装置
KR20180129339A (ko) 영상 압축 방법 및 영상 복원 방법
Chen et al. Sound to visual: Hierarchical cross-modal talking face video generation
WO2023231712A1 (zh) 数字人驱动方法、数字人驱动设备及存储介质
WO2023088276A1 (zh) 漫画化模型构建方法、装置、设备、存储介质及程序产品
CN114727120A (zh) 直播音频流的获取方法、装置、电子设备及存储介质
CN113840158B (zh) 虚拟形象的生成方法、装置、服务器及存储介质
CN111144287A (zh) 视听辅助交流方法、装置及可读存储介质
US11935323B2 (en) Imaging device and imaging method using feature compensation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23814898

Country of ref document: EP

Kind code of ref document: A1