CN115695943A

CN115695943A - Digital human video generation method, device, equipment and storage medium

Info

Publication number: CN115695943A
Application number: CN202211347701.3A
Authority: CN
Inventors: 张演龙; 李彤辉; 杨尊程
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-02-03

Abstract

The present disclosure provides a method, an apparatus, a device and a storage medium for generating a digital human video, which relate to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision. The specific implementation scheme is as follows: when receiving a target voice input by a user, the electronic device may retrieve response content corresponding to the target voice from a local target database. Further, PCM data corresponding to the response content is generated, and a plurality of lip-shaped picture frames corresponding to the PCM data are retrieved from the target database. And fusing each lip-shaped picture frame with a pre-recorded bottom plate video aiming at each lip-shaped picture frame in the plurality of lip-shaped picture frames, so as to obtain a digital human video frame corresponding to each lip-shaped picture frame. And finally, displaying the digital human video frame corresponding to each lip picture frame.

Description

Digital human video generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, image processing, and computer vision technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a digital human video.

Background

Currently, virtual digital human products are widely applied to the news broadcasting industry and the banking service industry. When generating the virtual digital person, a model (a required digital person image) needs to be shot in advance to obtain a section of bottom plate video, and then a preset virtual digital person is generated based on the obtained bottom plate video. When a user uses a virtual digital human product, the user voice can be collected through an Artificial Intelligence Internet of Things (AIOT) device, and the collected user voice is sent to a digital human server through the AIOT device, so that the digital human server analyzes the user voice to obtain voice data. Furthermore, the digital human server can generate response content corresponding to the user voice based on the voice data, then generate multi-frame images based on the response content and the preset virtual digital human, and encode the multi-frame images to obtain digital human video streams and push the digital human video streams to the streaming media server, so that the AIOT equipment can pull the corresponding digital human video streams to play, and service for the user through the virtual digital human is realized.

Disclosure of Invention

The disclosure provides a digital human video generation method, a device, equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided a digital human video generating method, including:

when receiving the target voice input by the user, the electronic device may retrieve response content corresponding to the target voice from a local target database. Further, PCM data corresponding to the response content is generated, and a plurality of lip-shaped picture frames corresponding to the PCM data are retrieved from the target database. And fusing each lip-shaped picture frame with a pre-recorded bottom plate video aiming at each lip-shaped picture frame in the plurality of lip-shaped picture frames, so as to obtain a digital human video frame corresponding to each lip-shaped picture frame. And finally, displaying the digital human video frame corresponding to each lip picture frame.

According to a second aspect of the present disclosure, there is provided a digital human video generating apparatus comprising: the retrieval unit is used for retrieving response content corresponding to a target voice from a target database, wherein the target voice is the voice input by a user in the electronic equipment, and the target database is a local database of the electronic equipment; the processing unit is used for generating Pulse Code Modulation (PCM) data corresponding to the response content; the retrieval unit is further used for retrieving a plurality of lip-shaped picture frames corresponding to the PCM data from the target database; the processing unit is further used for fusing each lip-shaped picture frame with a pre-recorded bottom plate video respectively aiming at each lip-shaped picture frame in the lip-shaped picture frames to obtain a digital human video frame corresponding to each lip-shaped picture frame; and the display unit is used for displaying the digital human video frame corresponding to each lip-shaped picture frame.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions, comprising:

the computer instructions are for causing a computer to perform any one of the methods of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising:

a computer program which, when executed by a processor, performs any of the methods of the first aspect.

According to the technology disclosed by the invention, the problems that the network is seriously depended on and the waiting time delay of human-computer interaction is long when data interaction is carried out between the electronic equipment and the server are solved. Furthermore, the technical scheme of the disclosure can also reduce the period and cost when the digital image is updated.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a method for generating a digital human video according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of another digital human video generation method provided by the embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram illustrating a further method for generating a digital human video according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram illustrating a further method for generating a digital human video according to an embodiment of the present disclosure;

fig. 5 is a diagram illustrating an example of a lip key provided by an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart diagram illustrating a further method for generating a digital human video according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a digital human video generating device provided by an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device of a digital human video generation method provided by an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Before the digital human video generation method of the embodiment of the present disclosure is described in detail, an application scenario of the embodiment of the present disclosure is described first.

First, an application scenario of the embodiment of the present disclosure is described.

In recent years, virtual digital people products are widely applied to the news broadcasting industry and the banking service industry. In the related art, when a virtual digital person is manufactured, a backplane video is recorded in advance, and a preset virtual digital person can be generated based on the backplane video. Then, when the user uses the virtual digital human product, the AIOT equipment can realize the question-answer interaction process with the user by matching with the digital human server.

However, in the related art, the digital human video stream of the response content corresponding to the voice input by the user can be generated only by interactive cooperation between the AIOT device and the digital human server, so as to be played on the AIOT device, thereby realizing a human-computer interaction process. In this case, the data interaction between the AIOT device and the digital human server depends heavily on the network, and the waiting time of the human-computer interaction is long, that is, after the user inputs voice, it takes a long time to get the feedback of the AIOT device.

In addition, the bottom plate video is shot in advance, when the digital human image needs to be replaced, a new model (a new digital human image) needs to be shot again to obtain a new bottom plate video, and then a digital human video stream corresponding to the new digital human image is generated based on the newly obtained bottom plate video through data interaction between the AIOT equipment and the digital human server, so that the period for updating the digital human image is long, and the cost is high.

In order to solve the above problem, an embodiment of the present disclosure provides a method for generating a digital human video, which is applied to an application scenario for generating a digital human video. In the method, when receiving a target voice input by a user, the electronic device may retrieve response content corresponding to the target voice from a local target database. Further, PCM data corresponding to the response content is generated, and a plurality of lip-shaped picture frames corresponding to the PCM data are retrieved from the target database. And fusing each lip-shaped picture frame with a pre-recorded bottom plate video aiming at each lip-shaped picture frame in the plurality of lip-shaped picture frames, so as to obtain a digital human video frame corresponding to each lip-shaped picture frame. And finally, displaying the digital human video frame corresponding to each lip picture frame.

It can be understood that the response contents corresponding to a plurality of voice contents may be stored in the local target database of the electronic device, and when the electronic device receives the target voice input by the user, the electronic device may directly retrieve the response contents corresponding to the target voice from the local target database, so as to improve the efficiency of obtaining the response contents corresponding to the target voice, and further reduce the waiting time delay of the human-computer interaction. And corresponding PCM data can be generated according to the retrieved response content, and a plurality of lip-shaped picture frames corresponding to the PCM data are retrieved from the target database, namely lip-shaped picture frames corresponding to different PCM data can be stored in the target database. Furthermore, after obtaining the plurality of lip-shaped picture frames corresponding to the PCM data from the target database, the electronic device may further fuse, for each lip-shaped picture frame of the plurality of lip-shaped picture frames, each lip-shaped picture frame with the pre-recorded bottom video, so as to obtain a digital human video frame corresponding to each lip-shaped picture frame. And finally, displaying the digital human video frame corresponding to each lip picture frame. Therefore, the problem that the network is seriously depended on when data interaction is carried out between the electronic equipment and the server is solved, the waiting time delay of man-machine interaction is reduced, and the use experience of a user is improved.

The electronic device according to the embodiment of the present disclosure is not limited. The electronic device in the embodiment of the present disclosure may be a self-service terminal (e.g., a bank self-service terminal, a medical self-service terminal, a ticketing self-service terminal, etc.), a small-scale smart terminal, and may also be a tablet computer, a mobile phone, a desktop, a laptop, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a vehicle-mounted device, and other devices.

The execution main body of the digital human video generation method provided by the disclosure can be a digital human video generation device, and the execution device can be an electronic device. Meanwhile, the execution device may also be a Central Processing Unit (CPU) of the electronic device, or a control module for generating a digital human video in the electronic device. In the embodiment of the present disclosure, a method for generating a digital human video by an electronic device is taken as an example to describe the method for generating a digital human video provided in the embodiment of the present disclosure.

As shown in fig. 1, a method for generating a digital human video provided by an embodiment of the present disclosure includes:

s101, the electronic equipment searches response content corresponding to the target voice from the target database.

The target voice is the voice input by the user in the electronic equipment, and the target database is a local database of the electronic equipment.

It should be noted that, in the embodiment of the present disclosure, when a user needs to acquire required information (i.e., response content), the user may perform voice interaction with the electronic device to input a target voice in the electronic device, so that the electronic device outputs corresponding response content based on the target voice input by the user.

In an application scenario, when a user needs to obtain an answer to a certain question through an electronic device, the user may operate a human-computer interaction control included in the electronic device to trigger the electronic device to start a voice acquisition function of the electronic device, and then, when the user needs to obtain weather information of today, the user may speak a question to ask, for example, "how is the weather of today? ", the electronic device can collect the target voice input by the user. In another application scenario, when the electronic device is in a sleep state, the user may input a wake-up word, such as "small degree" to trigger the electronic device to start its voice capturing function, and at this time, the user may speak a question to ask, such as "how much the weather is today", and the electronic device may capture the voice input by the user through the microphone.

In one possible implementation, the target database may include a plurality of sets of first correspondences, where a set of first correspondences includes: a voice data and a response content corresponding thereto.

Based on this, after the electronic device receives the target voice input by the user, the electronic device may retrieve the response content corresponding to the target voice from the target database based on the received target voice and the plurality of sets of first correspondence relationships stored in the target database.

As an example, the electronic device may specifically search, by a sequential search method, multiple sets of first correspondence relationships stored in the target database in sequence from far to near according to response content corresponding to the target voice retrieved from the target database; or sequentially searching a plurality of groups of first corresponding relations stored in the target database from near to far according to the reverse time sequence by a reverse search method; or, searching the first corresponding relation stored in the target database in a certain time period by a spot check method.

In another possible implementation manner, the target database may include a plurality of sets of second correspondences, where a set of second correspondences includes: one or more semantic keywords and response content corresponding thereto.

Based on this, after the electronic device receives the target voice input by the user, the electronic device may retrieve the response content corresponding to the target voice from the target database based on the received target voice and the plurality of sets of second correspondence relationships stored in the target database.

As an example, the electronic device may perform semantic parsing on the target voice to parse the target voice to obtain a plurality of semantic keywords, so as to retrieve response content corresponding to the target voice from the target database based on the plurality of semantic keywords and the plurality of sets of second corresponding relationships.

In another possible implementation manner, the target database may include a plurality of sets of third correspondences, where each set of third correspondences includes: a text data and a response content corresponding thereto.

Based on this, as shown in fig. 2, the S101 may specifically include:

s201, the electronic equipment converts the target voice into text data through ASR.

S202, the electronic equipment retrieves response content corresponding to the text data from the target database.

Optionally, after the user triggers the electronic device to start the voice acquiring function, the electronic device may call an Automatic Speech Recognition (ASR) module to identify the acquired target voice, and convert the target voice into text data through the voice text conversion module.

It should be noted that, after the ASR module is called, the electronic device can "listen and write" the continuous speech spoken by the user, so as to implement conversion from "voice" to "text". The most basic and meaningful information in the voice signal is automatically extracted and determined by adopting a digital signal processing technology. The main recognition framework of the ASR module can be obtained based on a Dynamic Time Warping (DTW) method for pattern matching and a Hidden Markov Model (HMM) method for statistical Model.

Specifically, in the process of calling the ASR module to recognize the target voice input by the user, the electronic device specifically needs to go through the processes of Training (Training), recognition (Recognition), and Distortion measure (disturbance Measures), analyze the voice characteristic parameters in advance through Training (Training), make a voice template, and store the voice template in the voice parameter library. Furthermore, the speech parameters are obtained by performing the same analysis on the speech to be recognized as the speech to be recognized during training through Recognition (Recognition). And comparing the voice parameters with reference templates in a voice parameter library one by one, and finding out the template closest to the voice characteristics by adopting a judgment method to obtain a recognition result. The "Distortion measure" between the speech feature parameter vectors is then measured by a Distortion measure (Distortion Measures) that has a criterion when making the comparison.

In the Training (Training) stage, a user can input each word in the vocabulary into the electronic equipment through voice, the electronic equipment receives the Training voice input by the user, each word is used as a recognition unit by calling an ASR module based on a DTW algorithm, and the recognition unit is used as a template after characteristics are extracted and stored in a template library.

In the Recognition (Recognition) stage, the features of each word to be recognized are extracted, matching is performed on each template in the template library by adopting a DTW algorithm, the shortest distance is calculated and calculated, namely the most similar word is recognized, and therefore the response content corresponding to the text data is retrieved from the target database.

After the electronic device converts the target voice into the text data, the electronic device may retrieve response content corresponding to the text data from the target database. That is, the electronic device may retrieve the response content corresponding to the target voice from the target database based on the text data corresponding to the target voice and the multiple sets of third correspondence stored in the target database.

It should be noted that, in a possible implementation manner, the multiple sets of first corresponding relationships, the second corresponding relationships, or the third corresponding relationships may be pre-entered into the target database by the user. For example, the user may store a plurality of voice data and corresponding response contents thereof in the target database in advance. In another possible implementation manner, the multiple sets of first corresponding relationships, the second corresponding relationships, or the third corresponding relationships may also be downloaded in advance through a network by the electronic device and stored in the target database.

Illustratively, in connection with the above example, the electronic device obtains the voice input by the user, "how is the weather today? ". The electronic device, after receiving the voice input by the user, may retrieve from the target database "how do it today? The corresponding response content, such as the response content retrieved by the electronic equipment, is "today's weather is sunny, the temperature is 8 to 22 degrees, and the southeast wind is 3 to 4 levels".

In the embodiment of the disclosure, the electronic device converts the target speech into the text data through the ASR, so as to retrieve the corresponding response content from the target database based on the text data, and the accuracy of retrieving the response content corresponding to the target speech from the target database can be improved based on the text content.

S102, the electronic equipment generates PCM data corresponding to the response content and retrieves a plurality of lip-shape picture frames corresponding to the PCM data from the target database.

As shown in fig. 3, the step S102 of generating PCM data corresponding to response content by the electronic device may specifically include:

s301, the electronic device generates PCM data corresponding to the response content through TTS.

It is understood that, after the electronic device retrieves the response content corresponding To the target voice from the target database, the electronic device may invoke a Text To Speech (TTS) module To generate corresponding Pulse Code Modulation (PCM) data based on the response content.

It should be noted that TTS is a technology for intelligently converting text into natural voice stream through the design of neural network. TTS converts text files in real time, and the conversion time is short. Under the action of a special intelligent voice controller, the voice rhythm of the text output is smooth, so that a listener feels natural when listening to information and does not have the feeling of indifference and acerbity of machine voice output. The TTS also has an English interface, automatically identifies Chinese and English, and supports mixed reading of Chinese and English. All sounds adopt the Mandarin as standard pronunciation, the rapid speech synthesis of 120-150 Chinese characters/minute is realized, the reading speed reaches 3-4 Chinese characters/second, and the user can hear clear and pleasant tone quality and coherent and smooth intonation.

PCM is a sampling technique for digitizing analog signals, which is a coding scheme for converting analog voice signals into digital signals, and the PCM samples the signals 8000 times per second; each sample is 8 bits, for a total of 64kb. The PCM samples, quantizes, and encodes a continuously varying analog signal to generate a digital signal. The sampling process changes continuous time analog signals into sampling signals with discrete time and continuous amplitude, the quantization process changes the sampling signals into digital signals with discrete time and discrete amplitude, and the coding process codes the quantized signals into a binary code group for output. Quantization is the process of discretizing the amplitude of the sampled instantaneous value by a set of predetermined levels, and encoding the instantaneous sampled value by the nearest coding, i.e. a set of binary codes representing each quantized value with a fixed level. The advantage of PCM is good sound quality. PCM may provide private line services for digital data at rates from 2M to 155M for users, as well as other services for voice, image transfer, distance learning, and the like. PCM is the most common, simplest waveform coding. The method is a method for directly and simply coding the sampled and A/D converted digits after uniform quantization.

After the electronic equipment calls the TTS module to generate corresponding PCM data based on the response content, a plurality of lip-shaped picture frames corresponding to the PCM data can be detected from the target database according to the generated PCM data.

In the embodiment of the disclosure, the electronic device may generate PCM data corresponding to the response content through TTS, and further convert the text content into PCM data, so that accuracy of obtaining the lip shape picture frame based on the PCM data may be improved, and a probability that the lip shape picture frame obtained at a later stage is not matched with the response content is reduced.

As shown in fig. 4, the "retrieving multiple lip-shaped picture frames corresponding to PCM data from the target database" in S102 may specifically include:

s401, inputting the PCM data into a preset deep learning network model by the electronic equipment to obtain a plurality of morpheme data corresponding to the PCM data.

Wherein the plurality of morpheme data is indicative of the lip parameter.

S402, the electronic equipment retrieves a plurality of lip-shape picture frames corresponding to a plurality of morpheme data from the target database.

Wherein, a morpheme data corresponds to a lip-shape picture frame.

In one possible implementation, the preset deep learning network model may be: deep Neural Network (DNN), recurrent Neural Network (RNN), convolutional Neural Network (CNN). The preset deep learning network model is not particularly limited by the present disclosure.

The morpheme data refers to the smallest phonetic-meaning combination in a language. A language unit must satisfy three conditions simultaneously, the minimum, tone, sense, can be called a morpheme. Minimum, refers to the size of the unit, relative to the high-level sound and meaning combination unit such as word, phrase, etc.; sense, meaning the meaning of a morpheme (the indicated aspect of a symbol) includes lexical or grammatical meaning. Voiced, which refers to the phonetic form of a morpheme (i.e., the referenceable aspect of a symbol). Syllables, as basic phonetic units, are also often the reference units used in describing the phonetic form of a morpheme. The language may be divided into monosyllabic, bisyllabic, trisyllable, and X syllables according to the morpheme-syllable matching relationship. For Chinese with monosyllabic morphemes as the main body, monosyllabic language units can be determined by directly using the definition of morphemes, and bi-syllable and multi-syllable language units can be determined by using a substitution method, i.e. known morphemes are used for substituting language units which are to be determined to be morphemes or not.

Illustratively, the PCM data corresponds to: under the conditions of o (a), ai and dai, inputting the PCM data into a preset deep learning network model to obtain a morpheme a corresponding to a word "o", two morphemes a and i corresponding to an love "and three morphemes d, a and i corresponding to a" die ", respectively. Further, the electronic device may determine, based on each morpheme, a lip-shaped key point corresponding to each morpheme, as shown in fig. 5, which is a lip-shaped key point corresponding to "a", so that the electronic device may retrieve, from the target database, a plurality of lip-shaped picture frames corresponding to each morpheme data based on the lip-shaped key point corresponding to each morpheme.

In this embodiment, in combination with the above example, the electronic device acquires "how much is there today? After the "response content" is clear today in weather, at a temperature of 8 to 22 degrees, and 3 to 4 levels of southeast wind ", corresponding PCM data, such as current (jin), day (tie), qi (qi), and sunny (qig) PCM data, may be generated by the TTS module based on the response content" is clear today in weather, at a temperature of 8 to 22 degrees, and 3 to 4 levels of southeast wind ". Further, the electronic device inputs the generated PCM data into a preset deep learning network model to obtain a plurality of morpheme data corresponding to the PCM data. For example, morpheme data such as "j", "i", "n", "t", "a", "q" and "g" are obtained from PCM data such as "jin", "Tian (tie)", qi (qi) "and" qing (qing) ". And finally, the electronic equipment retrieves the corresponding lip shape picture frame from the target database according to the obtained morpheme data.

In the embodiment of the disclosure, the electronic device obtains a plurality of morpheme data corresponding to the PCM data through a preset deep learning network model, and then retrieves a lip-shaped picture frame corresponding to each morpheme data from the target database based on a lip-shaped parameter indicated by each morpheme data. Therefore, the accuracy of the obtained lip shape picture frame can be improved based on the morpheme data.

S103, aiming at each lip-shaped picture frame in the lip-shaped picture frames, the electronic equipment fuses each lip-shaped picture frame with a pre-recorded bottom plate video respectively to obtain a digital human video frame corresponding to each lip-shaped picture frame.

In a possible implementation manner, the lip-shaped picture frame is fused with a pre-recorded backplane video, specifically, a glsl shader may be used, and a 1080P image quality effect may be achieved on a lower-end electronic device by using high performance of a Graphics Processing Unit (GPU).

It should be noted that, by means of the glsl shader, the fusion operation of the lip-shaped picture frame and the pre-recorded bottom-panel video can be executed on the GPU, on one hand, the fusion speed can be increased by means of the high concurrency of the GPU, and on the other hand, the occupation of a Central Processing Unit (CPU) can be reduced, so as to achieve higher image quality on the electronic device.

For example, after the electronic device retrieves a lip-shaped picture frame corresponding to each morpheme data from a target database according to each obtained morpheme data, the lip-shaped picture frames corresponding to "j", "i", "n", "t", "a", "q", and "g" are respectively fused with a pre-recorded backplane video by using the glsl loader with the high performance of the GPU, so as to obtain lip-shaped video frames in which a digital person in the backplane video utters a voice "j", "i", "n", "t", "a", "q", and "g" (i.e., the digital person video frame corresponding to each lip-shaped picture frame).

As shown in fig. 6, for another method for generating a digital human video according to an embodiment of the present disclosure, the step S103 may specifically include:

s501, the electronic equipment fuses each lip-shaped picture frame, the pre-recorded bottom plate video and the target character image to obtain a digital human video frame corresponding to each lip-shaped picture frame.

In an application scenario, when the digital human image needs to be replaced, the electronic device may further obtain a target human image corresponding to the digital human image that needs to be replaced in advance, so that in the process of fusing each lip-shaped picture frame with the pre-recorded bottom plate video, the target human image is fused with the lip-shaped picture frame and the pre-recorded bottom plate video, and the digital human image in the pre-recorded bottom plate video is replaced by the target human image.

In a possible implementation mode, the target character image and the pre-recorded bottom plate video can be processed through an end-to-end sequence learning model to obtain the face characteristic parameters of the target character; therefore, the target person image is fused to the pre-recorded bottom plate video according to the face characteristic parameters of the target person, and the image after face changing is obtained.

It should be noted that the end-to-end sequence learning model may be an end-to-end text-to-speech conversion model, such as: the Fast Speech model is a sequence learning model formed by adopting a feedforward network based on a Transformer and a self-attention mechanism in one-dimensional convolution.

It should be noted that, for each of the lip-like picture frames, the electronic device needs to perform the step in S103 respectively, so as to obtain a digital human video frame corresponding to each of the lip-like picture frames.

In the embodiment of the disclosure, in a scene where the digital human image needs to be replaced, the lip-shaped picture frame, the pre-recorded bottom plate video and the image of the target person are fused, so that the digital human video stream corresponding to the required digital human image can be directly generated. And the digital human video stream corresponding to the required digital human image can be further obtained without shooting the new digital human image again to obtain a new bottom plate video. Therefore, the efficiency of generating the virtual digital human video stream can be further improved, and the period and cost for generating the digital human video stream can be reduced.

And S104, the electronic equipment displays the digital human video frame corresponding to each lip shape picture frame.

In a possible implementation manner, the electronic device may sequentially display the digital human video frame corresponding to each lip shape picture frame, so as to form a digital human video stream to be played in the screen, so as to complete human-computer interaction with the user.

In another possible implementation manner, the electronic device may further synthesize the digital human video frames corresponding to each lip-shaped picture frame to obtain a digital human video stream, and display the digital human video stream.

In the embodiment of the disclosure, after the digital human video frame corresponding to each lip-shaped picture frame is obtained, the digital human video frame may be synthesized by a video synthesis technology to obtain a digital human video stream.

Further, after the digital human video stream is obtained, the electronic device can directly play the digital human video stream in the screen to complete human-computer interaction with the user.

In a possible implementation manner, the obtained digital human video stream may further correspond to a Non-homogeneous Token (NFT) in the blockchain, the NFT has uniqueness, and is based on a blockchain technology, so that the obtained digital human video stream establishes a unique mapping relationship with a specific digital commodity from birth, the NFT may serve as a corresponding and unique right certificate of the digital commodity on the specific blockchain, in an NFT limited sale state, works having different serial numbers on the blockchain are owned, and in a transaction process, transaction information of the unique right certificate may be stored on the blockchain through an intelligent contract record, so that trusted traceability can be achieved. An NFT is unique only on the current specific blockchain.

In this embodiment, the electronic device may independently complete all the steps in the above technical solution, and may implement the human-computer interaction function between the electronic device and the user without performing data interaction with the server. Alternatively, the electronic device may be combined with a server, and part of the steps in the above technical solution may be completed by the server (for example, the server executes the contents in the steps S102 and S103, etc.) to combine the server with the human-computer interaction function between the electronic device and the user. The specific scheme may be determined according to a specific application scenario, and the disclosure is not particularly limited.

Based on the technical scheme, the electronic device can obtain corresponding response content from the local target database based on target voice input by a user, generate PCM data corresponding to the response content, obtain a plurality of corresponding lip-shaped picture frames from the local target database based on the PCM data, and further obtain a required digital human video stream by fusing the lip-shaped picture frames with a preset bottom board video. According to the method and the device, the plurality of lip-shaped picture frames of the response content corresponding to the target voice input by the user can be obtained based on the local database of the electronic equipment, and then the obtained lip-shaped picture frames and the preset bottom plate video are fused, so that the required digital human video stream can be locally generated without data interaction with a digital human server, and therefore, the dependence on a network is not required, the efficiency of generating the virtual digital human video stream is improved, and the waiting time delay of human-computer interaction is reduced. Furthermore, in a scene where the digital human image needs to be replaced, the lip image frame, the pre-recorded bottom plate video and the image of the target person are fused, so that the digital human video stream corresponding to the required digital human image can be directly generated. The efficiency of generating the virtual digital human video stream can be further improved, and the cost of generating the digital human video stream can be reduced.

The foregoing describes the solution provided by embodiments of the present disclosure, primarily from the perspective of a computer device. It will be appreciated that the computer device, in order to implement the above-described functions, comprises corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the exemplary digital human video generation method steps described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is performed in hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the present disclosure may perform division of functional modules or functional units on the digital human video generation manner according to the above method example, for example, each functional module or functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module may be implemented in a form of hardware, or may be implemented in a form of a software functional module or a functional unit. The division of the modules or units in the embodiments of the present disclosure is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

As shown in fig. 7, a schematic structural diagram of a digital human video generating device according to an embodiment of the present disclosure is provided. The digital human video generating apparatus may include: a retrieval unit 701, a processing unit 702, and a display unit 703.

A retrieving unit 701, configured to retrieve response content corresponding to a target voice from a target database, where the target voice is a voice input by a user in an electronic device, and the target database is a local database of the electronic device; a processing unit 702, configured to generate pulse code modulation PCM data corresponding to the response content; a retrieving unit 701, configured to retrieve, from the target database, a plurality of lip-shaped picture frames corresponding to the PCM data; the processing unit 702 is further configured to fuse, for each lip-shaped picture frame of the multiple lip-shaped picture frames, each lip-shaped picture frame with a pre-recorded bottom video to obtain a digital human video frame corresponding to each lip-shaped picture frame; and the display unit 703 is configured to display the digital human video frame corresponding to each lip-shaped picture frame.

Optionally, the processing unit 702 is specifically configured to convert the target speech into text data through an automatic speech recognition ASR technique; the retrieving unit 701 is specifically configured to retrieve response content corresponding to the text data from the target database.

Optionally, the processing unit 702 is specifically configured to generate PCM data corresponding to the response content through a speech synthesis technology TTS.

Optionally, the processing unit 702 is specifically configured to input the PCM data into a preset deep learning network model to obtain a plurality of morpheme data corresponding to the PCM data, where the morpheme data are used to indicate a lip parameter; the retrieving unit 701 is specifically configured to retrieve, from the target database, a plurality of lip shape image frames corresponding to a plurality of morpheme data, where one morpheme data corresponds to one lip shape image frame.

Optionally, the processing unit 702 is specifically configured to fuse each lip shape picture frame, the pre-recorded base plate video, and the target person image to obtain a digital person video frame corresponding to each lip shape picture frame.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, such as the digital human video generation method. For example, in some embodiments, the digital human video generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more steps of the digital human video generation method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the digital human video generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A digital human video generation method, comprising:

searching response content corresponding to a target voice from a target database, wherein the target voice is the voice input by a user in electronic equipment, and the target database is a local database of the electronic equipment;

generating Pulse Code Modulation (PCM) data corresponding to the response content, and retrieving a plurality of lip shape picture frames corresponding to the PCM data from the target database;

for each lip-shaped picture frame in the lip-shaped picture frames, fusing each lip-shaped picture frame with a pre-recorded bottom plate video to obtain a digital human video frame corresponding to each lip-shaped picture frame;

and displaying the digital human video frame corresponding to each lip shape picture frame.

2. The method of claim 1, wherein said retrieving responsive content corresponding to the target voice from the target database comprises:

converting the target voice into text data through an automatic voice recognition technology (ASR);

and retrieving the response content corresponding to the text data from the target database.

3. The method according to claim 1 or 2, wherein said generating Pulse Code Modulation (PCM) data corresponding to said reply content comprises:

and generating PCM data corresponding to the response content through a TTS (speech synthesis technology).

4. The method of any of claims 1 to 3, wherein said retrieving a plurality of lip picture frames corresponding to the PCM data from the target database comprises:

inputting the PCM data into a preset deep learning network model to obtain a plurality of morpheme data corresponding to the PCM data, wherein the morpheme data are used for indicating lip parameters;

and retrieving the lip shape picture frames corresponding to the morpheme data from the target database, wherein one morpheme data corresponds to one lip shape picture frame.

5. The method according to any one of claims 1 to 4, wherein the fusing each lip-like picture frame with the pre-recorded backplane video to obtain a digital human video frame corresponding to each lip-like picture frame comprises:

and fusing each lip-shaped picture frame, the pre-recorded bottom plate video and the target character image to obtain a digital human video frame corresponding to each lip-shaped picture frame.

6. A digital human video generating apparatus, comprising:

the retrieval unit is used for retrieving response content corresponding to a target voice from a target database, wherein the target voice is a voice input by a user in the electronic equipment, and the target database is a local database of the electronic equipment;

the processing unit is used for generating Pulse Code Modulation (PCM) data corresponding to the response content;

the retrieval unit is further configured to retrieve, from the target database, a plurality of lip-shaped picture frames corresponding to the PCM data;

the processing unit is further configured to fuse, for each lip-shaped picture frame of the plurality of lip-shaped picture frames, each lip-shaped picture frame with a pre-recorded bottom video to obtain a digital human video frame corresponding to each lip-shaped picture frame;

and the display unit is used for displaying the digital human video frame corresponding to each lip-shaped picture frame.

7. The digital human video generating device of claim 6,

the processing unit is specifically used for converting the target speech into text data through an automatic speech recognition technology (ASR);

the retrieving unit is specifically configured to retrieve the response content corresponding to the text data from the target database.

8. The digital human video generating apparatus of claim 6 or 7, wherein,

the processing unit is specifically configured to generate PCM data corresponding to the response content through a speech synthesis technology TTS.

9. The digital human video generation apparatus of any of claims 6 to 8,

the processing unit is specifically configured to input the PCM data into a preset deep learning network model to obtain a plurality of morpheme data corresponding to the PCM data, where the morpheme data are used to indicate a lip parameter;

the retrieving unit is specifically configured to retrieve, from the target database, the lip shape picture frames corresponding to the morpheme data, where one morpheme data corresponds to one lip shape picture frame.

10. The digital human video generation apparatus of any of claims 6 to 9,

the processing unit is specifically configured to fuse each lip-shaped picture frame, the pre-recorded bottom plate video, and the target person image to obtain a digital person video frame corresponding to each lip-shaped picture frame.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.