CN114663556A

CN114663556A - Data interaction method, device, equipment, storage medium and program product

Info

Publication number: CN114663556A
Application number: CN202210327776.9A
Authority: CN
Inventors: 张演龙; 李彤辉; 孙静静
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-06-24
Also published as: JP2023059937A; KR20230005079A

Abstract

The present disclosure provides a data interaction method, apparatus, device, storage medium, and program product, which relate to the technical field of artificial intelligence, specifically to the technical field of deep learning, image processing, and computer vision, and can be applied to scenes such as face recognition. The specific implementation scheme is as follows: determining phoneme data corresponding to the response data in response to the response data; determining target lip shape image frames corresponding to the phoneme data one by one; respectively fusing the target lip-shaped image frame with the basic video frame to obtain a target video frame; and rendering the target video frame to obtain target display data.

Description

Data interaction method, device, equipment, storage medium and program product

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, in particular to the field of deep learning, image processing, and computer vision technologies, and may be applied to scenes such as face recognition, and in particular to a data interaction method, apparatus, device, storage medium, and program product.

Background

With the development of computer technology and internet technology, many intelligent products have a data interaction function so as to improve the use experience of users.

Disclosure of Invention

The disclosure provides a data interaction method, apparatus, device, storage medium and program product.

According to an aspect of the present disclosure, there is provided a data interaction method including determining phoneme data corresponding to response data in response to the response data; determining target lip shape image frames corresponding to the phoneme data one by one; respectively fusing the target lip-shaped image frame with the basic video frame to obtain a target video frame; and rendering the target video frame to obtain target display data.

According to another aspect of the present disclosure, there is provided a data interaction apparatus including a phoneme data determining module for determining phoneme data corresponding to response data in response to the response data; the target lip shape image frame determining module is used for determining target lip shape image frames corresponding to the phoneme data one by one; the fusion module is used for fusing the target lip-shaped image frame with the basic video frame respectively to obtain a target video frame; and the rendering module is used for rendering the target video frame to obtain target display data.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the disclosed embodiments.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of an embodiment of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture diagram of a data interaction method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of a data interaction method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a schematic diagram of a data interaction method according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of determining phoneme data according to an embodiment of the present disclosure;

fig. 5 schematically illustrates a schematic diagram of determining a target lip image frame according to an embodiment of the present disclosure;

FIG. 6 schematically shows a schematic diagram of obtaining a target video frame according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of obtaining target display data according to an embodiment of the present disclosure;

FIG. 8 schematically shows a flow chart of a method of data interaction according to a further embodiment of the present disclosure;

FIG. 9 schematically shows a block diagram of a data interaction device according to an embodiment of the present disclosure; and

fig. 10 schematically shows a block diagram of an electronic device that can implement the data interaction method of the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

For example, a variety of smart products interact with data between users through avatars such as digital people. The digital human is an virtual image and can perform virtual simulation on the shape and the function of the human body.

In some embodiments, the terminal device may collect input voice data of a user and send the input voice data to the digital human server background, and the server background may perform voice analysis after obtaining the input voice data to obtain analyzed input voice data, and generate response content according to the analyzed input voice data. Then, each image frame of the virtual image is generated according to the response content drive, each image frame forms a video stream after being coded, and the video stream is pushed to a streaming media server. The terminal equipment can pull the video stream in the streaming media server to play, and the data interaction process is completed.

Fig. 1 schematically shows a system architecture of a data interaction method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

clients

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

clients

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

clients

101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various messaging client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (examples only) may be installed on the

clients

101, 102, 103.

Clients

101, 102, 103 may be a variety of electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablets, laptop and desktop computers, and the like. The

clients

101, 102, 103 of the disclosed embodiments may run applications, for example.

The server 105 may be a server that provides various services, such as a back-office management server (for example only) that provides support for websites browsed by users using the

clients

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

It should be noted that the data interaction method provided by the embodiments of the present disclosure may be executed by the

clients

101, 102, 103. Accordingly, the data interaction device provided by the embodiment of the present disclosure may be disposed in the

clients

101, 102, 103. The data interaction method provided by the embodiments of the present disclosure may also be performed by a client or a cluster of clients different from the

clients

101, 102, 103 and capable of communicating with the server 105 and/or the

clients

101, 102, 103. Correspondingly, the data interaction device provided by the embodiment of the present disclosure may also be disposed in a client or a client cluster of the

clients

101, 102, 103 and capable of communicating with the server 105 and/or the

clients

101, 102, 103.

In one example,

clients

101, 102, 103 may obtain response data from server 105 over network 104.

It should be understood that the number of clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.

It should be noted that in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user are all in accordance with the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The embodiment of the present disclosure provides a data interaction method, and a data interaction method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 4 in conjunction with the system architecture of fig. 1. The data interaction method of the embodiment of the present disclosure may be performed by a client illustrated in fig. 1, for example.

Fig. 2 schematically shows a flow chart of a data interaction method according to an embodiment of the present disclosure.

As shown in fig. 2, the data interaction method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S240.

In operation S210, phoneme data corresponding to the response data is determined in response to the response data.

The data interaction method of the embodiment of the present disclosure will be described by taking a digital human as an example. Response data may be understood as data for responding to input data. For example, a user asks a question through a client, the question is input data, the reply content to the question is response data, and the response data can be output to the user by a digital person displayed on the client.

A phoneme is understood to be the smallest unit of speech that is divided according to the natural properties of the speech, and each pronunciation action in a syllable may constitute a phoneme. Therefore, a phoneme is the smallest unit or the smallest speech segment constituting a syllable, and is the smallest linear speech unit divided from the viewpoint of sound quality.

When the response content is in a voice format, corresponding phoneme data can be obtained according to the response content. When the response content is in the text format, the response content in the text format may be converted into the speech format, and the corresponding phoneme data may be determined.

In operation S220, a target lip image frame corresponding to the phoneme data one to one is determined.

It is understood that phonemes are the smallest phonetic unit. Therefore, with respect to the response data, the target lip image frame determined from the phoneme data can accurately reflect the lip action when the response data is speech-output.

In operation S230, the target lip-shaped image frames are respectively fused with the base video frames to obtain target video frames.

A base video frame may be understood as a template video frame of a digital person, which may for example comprise the overall image and the background of the digital person. It can be appreciated that the digital human simulates a real human voice to output response data. Correspondingly, when the voice outputs the response data, the lip shape changes along with the change of different pronunciations, other parts of the digital person can be kept unchanged, and the target video frame obtained by fusing the target lip shape image frame and the basic video frame can accurately reflect the whole state of the digital person.

In operation S240, the target video frame is rendered, resulting in target display data.

The target display data may be encoded to form a video stream, for example, and the video stream may be played at the client.

The data interaction method of the embodiment of the disclosure can accurately represent the pronunciation process when the response data is output by voice by determining the phoneme data of the minimum voice unit. By determining the target lip shape image frame corresponding to the phoneme data, the lip shape when the voice is output as the response data can be accurately determined. By fusing the target lip-shaped image frame and the basic video frame, the target video frame can be determined quickly and efficiently, the target display data obtained by rendering the target video frame can accurately display the integral state of the digital person, and the use experience of a user is improved.

The data interaction method can perform data interaction in a network-free environment. For example, the relevant operations of the data interaction method of the embodiment of the present disclosure can all be executed by the client, and data interaction with the server through the network is not needed, so that the dependence on the network can be reduced, and the situations of reduced response speed, non-timely interaction and the like during data interaction caused by poor network quality are avoided. Therefore, the execution process of each operation of the data interaction method is faster and more efficient, the data interaction efficiency is higher, and the use experience of a user can be improved.

FIG. 3 schematically shows a schematic diagram of a data interaction method according to another embodiment of the present disclosure.

As shown in fig. 3, a data interaction method 300 according to another embodiment of the present disclosure may include operations S310 to S340.

In operation S310, in response to the response data 301, phoneme data corresponding to the response data 301 is determined. Fig. 3 schematically shows n pieces of phoneme data, for example, phoneme data Phone _1 to phoneme data Phone _ n.

In operation S320, target lip image frames corresponding to the phoneme data one to one are determined. Fig. 3 schematically shows n target lip image frames, for example, a target lip image frame PL _1 to a target lip image frame PL _ n.

In operation S330, the target lip image frames are respectively fused with the base video frames Pf to obtain target video frames. Fig. 3 schematically shows n target video frames, e.g. target video frame PT _1 to target video frame PT _ n.

In operation S340, a target video frame is rendered, resulting in target display data. Fig. 3 schematically shows n pieces of target display data, for example, target display data V _1 to target display data V _ n.

Fig. 4 schematically shows a schematic diagram of determining phoneme data in a data interaction method according to still another embodiment of the present disclosure.

According to still another embodiment of the present disclosure, a specific example of determining phoneme data corresponding to response data in response to the response data in the data interaction method may be implemented using the following embodiments. The response data may include response voice data.

As shown in fig. 4, in operation S411, a speech feature vector 402 of a sound frame is determined according to response speech data 401.

The voice frame may include a plurality of frames, each of which is obtained by dividing the response voice data according to the division frequency.

A speech feature vector may be understood as a feature vector extracted from speech data that is available for computer processing.

The voice characteristic vector accords with or is similar to the auditory perception characteristic of human ears, and can enhance voice information to a certain extent and inhibit non-voice signals.

Illustratively, the speech feature vector extraction may be performed on the answer speech data by one of the following speech feature vector extraction methods: a linear prediction analysis method, a perceptual linear prediction coefficient method, a bottleneck feature extraction method, a linear prediction cepstrum coefficient method, and a mel-frequency cepstrum coefficient method.

Hereinafter, the following description will be given taking an example of extracting a speech feature vector of response speech data by using a Mel-Frequency Cepstral Coefficients (Mel-scale Frequency Cepstral Coefficients, abbreviated as MFCC).

According to the research of human auditory mechanism, human ears have different auditory sensitivities to sound waves with different frequencies. Speech signals from 200Hz to 5000Hz have a large impact on the intelligibility of speech. When two sounds with different loudness act on human ears, the presence of frequency components with higher loudness affects the perception of frequency components with lower loudness, making them less noticeable, which is called masking effect. Since lower frequency sounds travel a greater distance up the cochlear inner basilar membrane than higher frequency sounds, generally bass sounds tend to mask treble sounds, while treble sounds mask bass sounds more difficult. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, a group of band-pass filters is arranged according to the size of a critical bandwidth in a frequency band from low frequency to high frequency to filter the input signal. The signal energy output by each band-pass filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of voice after being further processed. Since the characteristics do not depend on the properties of the signals, no assumptions and restrictions are made on the input signals, and the research results of the auditory model are utilized. Therefore, the parameter determined based on the mel-frequency cepstrum coefficient has better robustness, better accords with the auditory characteristics of human ears, and still has better identification performance when the signal-to-noise ratio is reduced.

The extraction of the speech feature vector based on the Mel frequency cepstrum coefficient method comprises the following operations: pre-emphasis → framing → windowing → fast fourier transform → triangular band-pass filter → mel-frequency filter bank → calculating the logarithmic energy output by each filter bank → obtaining MFCC coefficients by discrete cosine transform.

In operation S412, the speech feature vector 402 is input to the hidden markov model HMM, and the state data 403 of each sound frame is determined.

Hidden Markov models (HMM for short) are statistical models that describe a Markov process with Hidden unknown parameters.

In the field of speech recognition technology, a hidden markov model may determine state data of each sound frame according to an input speech feature vector, wherein a state may be determined according to a sound frame, and a state may be understood as a speech unit smaller than a phoneme. For example, one phoneme may be divided into three states.

In operation S413, phoneme data 404 is determined according to the state data of the sound frame.

For example, one phoneme may be determined from consecutive 3 sound frames having the same status. Thus, phoneme data can be determined from the state data of the sound frame.

The data interaction method of the embodiment of the disclosure can extract the speech feature vector according to the characteristics of the speech data aiming at the response speech data, and accurately and efficiently determine the phoneme data by using the specific vocal tract model of the hidden Markov model.

It is to be appreciated that the hidden markov models can be located at the client. Therefore, the speech feature vector can be input into a local hidden Markov model, the state data of each sound frame is determined, and the phoneme data is further determined.

Fig. 5 schematically illustrates a schematic diagram of determining a target lip shape image frame in a data interaction method according to still another embodiment of the present disclosure.

According to still another embodiment of the present disclosure, a specific example of determining a target lip image frame in one-to-one correspondence with phoneme data in the data interaction method may be implemented using the following embodiments.

In operation S521, lip key point data 502 corresponding to the phoneme data 501 is determined.

Lip key points are understood to be points that can distinguish between different lips.

Illustratively, lip key data corresponding to phoneme data may be determined, for example, by a target detection model.

In operation S522, a target lip image frame 504 that matches the lip key point data 502 is determined from the lip image frame set 503 according to the lip key point data 502.

The lip shape image frame set may be stored locally at the client. According to the data interaction method, the local lip-shaped image frame set can be quickly and efficiently retrieved according to the lip-shaped key point data, one lip-shaped image frame with high matching degree with the lip-shaped key point data is determined to be the target lip-shaped image frame from the lip-shaped image frame set, the network is not depended on, and the use experience of a user can be improved.

Fig. 6 schematically shows a schematic diagram of obtaining a target video frame in a data interaction method according to another embodiment of the present disclosure.

According to another embodiment of the present disclosure, the following embodiments may be used to implement a specific example of fusing a target lip-shaped image frame with a base video frame respectively in a data interaction method to obtain a target video frame.

In operation S631, the lip mask 602 is determined from the target lip image frame 601.

A mask is understood to be a selected image, pattern, which is used to mask the processed image to control the area or process of image processing.

In operation S632, a fusion path 604 is determined according to the lip mask 602 and the base video frame 603.

Illustratively, the fused path between the lip mask and the base video frame may be determined according to an energy minimization search strategy. For example, the fusion area of the lip mask and the basic video frame can be predetermined, and "energy" can be understood as the sum of squares of differences of image pixels on two sides of the fusion area.

In operation S633, the lip mask 602 and the base video frame 603 are fused according to the fusion path 604, resulting in a target video frame 605.

According to the data interaction method, the lip mask can be determined according to the target lip image frame, and the lip mask and the basic video frame are fused according to the fusion path between the lip mask and the basic video frame, so that the lip mask and the basic video frame have smaller pixel difference on two sides of the fusion path, a better fusion effect is achieved, and a more natural target video frame can be obtained.

Fig. 7 schematically shows a schematic diagram of obtaining target display data in a data interaction method according to yet another embodiment of the present disclosure.

According to another embodiment of the present disclosure, the following embodiments may be used to implement a specific example of rendering a target video frame in a data interaction method to obtain target display data.

In operation S741, vertex coordinate data 702 based on the screen coordinate system is determined according to the vertex data 701 of the target video frame.

Exemplary vertex data may include: coordinates of each vertex are expressed in the form of an array. Vertex coordinate data based on a screen coordinate system may be determined from the vertex data of the target video frame using a vertex shader. The vertex shader may also perform some basic processing on the vertex attributes.

In operation S742, primitive data 703 is determined according to the vertex coordinate data 702.

Primitive data may be used as a reference for how vertex data is rendered. For example, primitive data may include: points, lines, triangles.

In operation S743, the unit processing is performed on the primitive data 703, and the target graphics data 704 is generated.

Illustratively, the primitive data may be processed unitarily with a geometry shader to generate target graphics data. The unitization process may include, for example: new vertices are generated and the vertices are connected to generate the target graphics data.

In operation S744, the pixel conversion process is performed on the target graphics data 704, resulting in the pixel data 705 of the target graphics data.

Operation S744 may be understood as a rasterization process that maps the graph metadata to corresponding pixels on the final screen, generating a fragment. A fragment is all the data needed to render one pixel.

In operation S745, the color data of each pixel point is determined according to the pixel data 705, and the target display data 706 is obtained.

According to the data interaction method disclosed by the embodiment of the disclosure, the target display data can be rendered on the screen of the client through the specific operation of rendering the target video frame. The above operation of rendering a target video frame is based on a graphics processor. A Graphics Processing Unit (GPU) is a microprocessor that performs image and Graphics related operations on personal computers, workstations, game machines, and some mobile devices. The GPU has strong computing power, can improve rendering efficiency when used for image rendering, and can reduce the resource use of the CPU. According to the data interaction method, the image is rendered through the GPU, and a good display effect can be rendered at a client with a low configuration.

According to the data interaction method of the embodiment of the disclosure, the response voice data can be obtained from the corresponding response text data, the response text data can be obtained from the corresponding input text data, and the input text data can be obtained from the corresponding input voice data.

Input speech data may be understood as input data in the form of speech uttered by a user.

For example, the input speech data may be subjected to speech recognition, resulting in input text data. The input speech data may be speech recognized, for example, by invoking an interface of a speech recognition module. The speech recognition module may be located on the server side. At the moment, the connection and data interaction between the client and the server depend on the network.

Illustratively, the answer text data corresponding to the input text data may be retrieved at a local database or server side. For example, where the local database stores configured reply text data, the configured reply text data stored in the local database may be retrieved in response to the input data. Under the online application scene, the response text data can be retrieved at the server end in response to the input data, and the connection and data interaction between the client and the server end depend on the network at the moment.

For example, the response Text data may be processed from Text To Speech (Text To Speech, abbreviated as TTS) To obtain response Speech data. The response voice data may be, for example, Pulse Code Modulation (Pulse Code Modulation, PCM). For example, an interface from text to speech module may be invoked to process the response text from text to speech. The text-to-speech module may be located on the server side. At the moment, the connection and data interaction between the client and the server depend on the network.

Fig. 8 schematically shows a flow chart of a data interaction method according to yet another embodiment of the present disclosure.

As shown in fig. 8, the data interaction method 800 according to still another embodiment of the present disclosure may further include operation S850.

In operation S850, the response voice data is played in synchronization with the target display data.

As shown in fig. 8, the data interaction method 800 may further include operations S810 to S840 before operation S850. Operations S810 through S840 are the same as operations S210 through S240, respectively, and are not described herein again.

The data interaction method of the embodiment of the present disclosure is still exemplified by digital human. The synchronous playing of the response voice data and the target display data can provide synchronous voice output and visual output for the user, and the use experience of the user is improved.

For example, the target display data and the response voice data may be synchronized when a display frequency of an image frame corresponding to the target display data is the same as a play frequency of a sound frame corresponding to the response voice data.

FIG. 9 schematically shows a block diagram of a data interaction device according to an embodiment of the present disclosure.

As shown in fig. 9, the data interaction apparatus 900 of the embodiment of the present disclosure includes, for example, a phoneme data determination module 910, a target lip image frame determination module 920, a fusion module 930, and a rendering module 940.

And the phoneme data determining module is used for responding to the response data and determining phoneme data corresponding to the response data.

And the target lip shape image frame determining module is used for determining the target lip shape image frames which correspond to the phoneme data one by one.

And the fusion module is used for fusing the target lip-shaped image frame with the basic video frame respectively to obtain a target video frame.

And the rendering module is used for rendering the target video frame to obtain target display data.

According to the data interaction device of the embodiment of the disclosure, the response data comprises response voice data; the phoneme data determination module may include: a speech feature vector determination submodule, a state data determination submodule and a phoneme data determination submodule.

And the voice feature vector determining submodule is used for determining the voice feature vectors of the voice frames according to the response voice data, wherein the voice frames comprise a plurality of voice frames, and each voice frame is obtained by dividing the response voice data according to the division frequency.

And the state data determining submodule is used for inputting the voice feature vector into the hidden Markov model and determining the state data of each voice frame.

And the phoneme data determining submodule is used for determining phoneme data according to the state data of the sound frame.

The data interaction device according to an embodiment of the present disclosure, wherein the target lip shape image frame determination module may include: a lip key point determination submodule and a target lip image frame determination submodule.

And the lip key point determining submodule is used for determining lip key point data corresponding to the phoneme data.

And the target lip image frame determining submodule is used for determining a target lip image frame matched with the lip key point data from the lip image frame set according to the lip key point data.

According to the data interaction device of the embodiment of the disclosure, the fusion module may include: a lip mask determination submodule, a fusion path determination submodule, and a fusion submodule.

And the lip mask determining submodule is used for determining the lip mask according to the target lip image frame.

And the fusion path determining submodule is used for determining a fusion path according to the lip mask and the basic video frame.

And the fusion submodule is used for fusing the lip mask and the basic video frame according to the fusion path to obtain the target video frame.

The data interaction device according to the embodiment of the disclosure, wherein the rendering module may include: the system comprises a vertex coordinate data determining submodule, a primitive data determining submodule, a target image data determining submodule, a pixel data determining submodule and a target display data determining submodule.

And the vertex coordinate data determining submodule is used for determining vertex coordinate data based on a screen coordinate system according to the vertex data of the target video frame.

And the primitive data determining submodule is used for determining the primitive data according to the vertex coordinate data.

And the target image data determining submodule is used for performing unitization processing on the primitive data to generate target graphic data.

And the pixel data determination submodule is used for carrying out pixel conversion processing on the target graphic data to obtain the pixel data of the target graphic data.

And the target display data determining submodule is used for determining the color data of each pixel point according to the pixel data to obtain target display data.

According to the data interaction device of the embodiment of the disclosure, the response voice data is obtained from the corresponding response text data, the response text data is obtained from the corresponding input text data, and the input text data is obtained from the corresponding input voice data.

The data interaction device according to the embodiment of the present disclosure may further include: and a response voice data playing module.

And the response voice data playing module is used for synchronously playing the response voice data with the target display data.

It should be understood that the embodiments of the apparatus part of the present disclosure are the same as or similar to the embodiments of the method part of the present disclosure, and the technical problems to be solved and the technical effects to be achieved are also the same as or similar to each other, and the detailed description of the present disclosure is omitted.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as the data interaction method. For example, in some embodiments, the data interaction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM1003 and executed by the computing unit 1001, one or more steps of the data interaction method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the data interaction method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A data interaction method, comprising:

responding to response data, and determining phoneme data corresponding to the response data;

determining a target lip shape image frame corresponding to the phoneme data one by one;

fusing the target lip-shaped image frame with a basic video frame respectively to obtain a target video frame; and

and rendering the target video frame to obtain target display data.

2. The method of claim 1, wherein the response data comprises response voice data; the determining, in response to answer data, phoneme data corresponding to the answer data includes:

determining a voice feature vector of a voice frame according to the response voice data, wherein the voice frame comprises a plurality of voice frames, and each voice frame is obtained by dividing the response voice data according to a dividing frequency;

inputting the voice feature vector into a hidden Markov model, and determining the state data of each voice frame; and

and determining phoneme data according to the state data of the sound frame.

3. The method of claim 1, wherein the determining a target lip image frame that corresponds one-to-one with the phoneme data comprises:

determining lip-shaped key point data corresponding to the phoneme data; and

determining the target lip shape image frame matched with the lip shape key point data from the lip shape image frame set according to the lip shape key point data.

4. The method according to any one of claims 1-3, wherein the fusing the target lip image frames with the base video frames, respectively, to obtain target video frames comprises:

determining a lip mask according to the target lip image frame;

determining a fusion path according to the lip mask and the basic video frame; and

and fusing the lip mask and the basic video frame according to the fusion path to obtain the target video frame.

5. The method of any of claims 1-3, wherein said rendering the target video frame resulting in target display data comprises:

determining vertex coordinate data based on a screen coordinate system according to the vertex data of the target video frame;

determining primitive data according to the vertex coordinate data;

performing unitization processing on the primitive data to generate target primitive data;

performing pixel conversion processing on the target graphic data to obtain pixel data of the target graphic data; and

and determining the color data of each pixel point according to the pixel data to obtain the target display data.

6. The method of claim 2, wherein the responsive speech data is derived from corresponding responsive text data derived from corresponding input speech data.

7. The method of claim 2, further comprising:

and synchronously playing the response voice data with the target display data.

8. A data interaction device, comprising:

a phoneme data determining module for determining phoneme data corresponding to the response data in response to the response data;

a target lip shape image frame determining module, configured to determine target lip shape image frames corresponding to the phoneme data one to one;

the fusion module is used for fusing the target lip-shaped image frame with a basic video frame respectively to obtain a target video frame; and

9. The apparatus of claim 8, wherein the response data comprises response voice data; the phoneme data determination module comprises:

the voice feature vector determining submodule is used for determining a plurality of voice feature vectors of a voice frame according to the response voice data, wherein each voice frame is obtained by dividing the response voice data according to dividing frequency;

the state data determining submodule is used for inputting the voice feature vector into a hidden Markov model and determining the state data of each voice frame; and

10. The apparatus of claim 8, wherein the target lip image frame determination module comprises:

a lip key point determining submodule for determining lip key point data corresponding to the phoneme data; and

and the target lip image frame determining submodule is used for determining the target lip image frame matched with the lip key point data from the lip image frame set according to the lip key point data.

11. The apparatus of any one of claims 8-10, wherein the fusion module comprises:

the lip mask determining submodule is used for determining a lip mask according to the target lip image frame;

the fusion path determining submodule is used for determining a fusion path according to the lip mask and the basic video frame; and

12. The apparatus of any of claims 8-10, wherein the rendering module comprises:

the vertex coordinate data determining submodule is used for determining vertex coordinate data based on a screen coordinate system according to the vertex data of the target video frame;

the primitive data determining submodule is used for determining primitive data according to the vertex coordinate data;

the target image data determining submodule is used for performing unitization processing on the primitive data to generate target graphics data;

the pixel data determination submodule is used for carrying out pixel conversion processing on the target graphic data to obtain pixel data of the target graphic data; and

and the target display data determining submodule is used for determining the color data of each pixel point according to the pixel data to obtain the target display data.

13. The apparatus of claim 9, wherein the responsive speech data is derived from corresponding responsive text data derived from corresponding input speech data.

14. The apparatus of claim 9, further comprising:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.