CN117333404A

CN117333404A - Rendering method and rendering engine

Info

Publication number: CN117333404A
Application number: CN202210727011.4A
Authority: CN
Inventors: 林竹; 郑洛
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2024-01-02

Abstract

The present disclosure relates to the field of data processing, and in particular, to a rendering method and a rendering engine. The method comprises the following steps: acquiring a first frame image, wherein the first frame image comprises a target object; performing target operation on the first frame image to obtain a second frame image, wherein the data volume of the second frame image is smaller than that of the first frame image; and sending the second frame image to the rendering server so that the rendering server renders the target object. According to the method, the data volume of the image is reduced and then sent to the rendering server, so that the network bandwidth between the rendering client and the rendering server is saved, and the dependence of image rendering on the network performance between the rendering client and the rendering server is reduced.

Description

Rendering method and rendering engine

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a rendering method and a rendering engine.

Background

The visual driving technology recognizes the motion of a person or an object in the video through the motion recognition model to obtain motion information, and renders a virtual object similar to the person or the object in the video in structure by utilizing the motion information. The visual driving technology can be divided into a video acquisition side and a video application side. The video acquisition side refers to the side for capturing the video. The application side of the video is the side that obtains the motion information by using the video and renders the virtual object by using the motion information.

In reality, the acquisition side and the application side are often not co-located. For example, a user captures a video by adopting a terminal device such as a mobile phone and sends the video to a cloud device through a network. The cloud device identifies the action information from the video and analyzes and utilizes the action information.

It is easy to understand that the data volume of the video is huge, and the requirements on the bandwidth and the stability of the network are high, so that the problem that the video is easily lost when the video is generated at the side of the acquisition side application, and the user experience is poor.

Disclosure of Invention

The embodiment of the application provides a rendering method and a rendering engine, which can save network bandwidth between a rendering client and a rendering server and improve user experience.

In a first aspect, a rendering method is provided, applied to a rendering client in a terminal device, where a rendering system where the rendering client is located further includes a rendering server, and the method includes: acquiring a first frame image, wherein the first frame image comprises a target object; performing target operation on the first frame image to obtain a second frame image, wherein the data volume of the second frame image is smaller than that of the first frame image; and sending the second frame image to the rendering server so that the rendering server renders the target object.

In the method, the rendering client can reduce the data amount of the image and send the image with the reduced data amount to the rendering server, so that the network bandwidth between the rendering client and the rendering server can be saved, the dependence of image rendering on the network performance between the rendering client and the rendering server is reduced, the sent image has less data amount, the data transmission delay between the rendering client and the rendering server is reduced, the frame loss probability is reduced, and the user experience is improved.

In one possible implementation manner, the method is applied to virtual digital person rendering of a scene, and the first frame of image is acquired, and includes: and acquiring a first frame image comprising the human image from a camera, wherein the camera is used for acquiring an image corresponding to the human face.

In the implementation mode, the method can be applied to rendering scenes in the virtual data, so that the synchronism between the real person and the virtual digital person can be improved, and the user experience is improved.

In one possible implementation, the target operation includes gray scale processing.

In the implementation mode, the image is subjected to gray processing, so that the information irrelevant to the shape or the action of the target object in the image is removed, and the rendering effect of the rendering server is not affected while the data volume of the image is reduced.

In one possible implementation, the target operation includes high pass filtering.

In the implementation mode, the high-pass filtering is carried out on the image, so that low-frequency information in the image can be removed, high-frequency information of the image is reserved, and the rendering effect of the rendering server is not affected while the data size of the image is reduced.

In one possible implementation, the method further includes: and receiving and displaying a rendering result of the second frame image sent by the rendering server.

In the implementation mode, the rendering client sends the image to the rendering server, and the rendering server renders the image according to the image, so that the performance requirement on the rendering client is reduced. And the rendering client can receive and display the rendering result, so that a user can see the rendering result through the rendering client without special display equipment, and the cost for the user to watch the rendering result is reduced.

In one possible implementation manner, the terminal device further includes an image processor, performing a target operation on the first frame image, including: the first frame image is subject to a target operation by an image processor.

In the implementation mode, the image processing is utilized to process the image, so that the efficiency of image processing can be improved, and the time delay between the rendering client and the rendering server is further reduced.

In a second aspect, there is provided a rendering engine configured at a terminal device, the rendering engine comprising: the acquisition module is used for acquiring a first frame image, wherein the first frame image comprises a target object; the operation module is used for carrying out target operation on the first frame image to obtain a second frame image, and the data size of the second frame image is smaller than that of the first frame image; and the communication module is used for sending the second frame image to the rendering server so that the rendering server renders the target object.

In one possible implementation, the rendering engine is applied to virtual digital person rendering of a scene, and the acquisition module is configured to: and acquiring a first frame image comprising the human image from a camera, wherein the camera is used for acquiring an image corresponding to the human face.

In one possible implementation, the communication module is further configured to:

and receiving and displaying a rendering result of the second frame image sent by the rendering server.

In one possible implementation manner, the terminal device further includes an image processor, and the operation module is configured to: the first frame image is subject to a target operation by an image processor.

In a third aspect, a cluster of computing devices is provided, comprising at least one computing device, each computing device comprising a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method as provided in the first aspect.

In a fourth aspect, there is provided a computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform the method as provided in the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method as provided in the first aspect.

In a sixth aspect, a rendering system is provided, including the rendering engine and the rendering server provided in the second aspect.

In the rendering method provided by the embodiment of the application, the rendering client can perform image processing on the acquired image, reduce the data volume of the image, and send the image with the reduced data volume to the rendering server, so that the network bandwidth between the rendering client and the rendering server can be saved, the dependence of image rendering on the network performance between the rendering client and the rendering server is reduced, the data volume of the sent image is less, the data transmission delay between the rendering client and the rendering server is reduced, the frame loss probability is reduced, and the user experience is improved.

Drawings

FIG. 1 is a schematic diagram of a visual driving scheme;

FIG. 2 is a schematic diagram of another visual driving scheme;

fig. 3 is a schematic structural diagram of a rendering system according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a rendering system according to an embodiment of the present application;

FIG. 5 is a flow chart of a rendering scheme provided by an embodiment of the present application;

FIG. 6 is a flowchart of a rendering method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a rendering engine according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a computing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computing device cluster according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computing device cluster according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification.

Vision-based motion recognition has a wide range of applications, such as vision driving, video surveillance, and the like. Among them, visual driving is a technique for driving rendering of a virtual object having a shape or structure similar to that of a target object by using motion information of the recognized target object, and is widely used in fields such as video production and Virtual Reality (VR). The object may be a user, and the virtual object may be a virtual digital person corresponding to the user. Wherein, the virtual digital person refers to a virtual character with a digital appearance. The virtual digital person has the appearance of a person and has the characteristics of a specific looks, sex, character and the like. The virtual digital person may also possess the person's behavior with the ability to express in language, facial expression, and limb movements. In addition, the virtual digital person can also possess the thought of a person and has the capability of identifying the external environment and interacting with the person.

In one approach, as shown in fig. 1, a user collects video including an image of the user using a terminal device, and identifies motion information of the user based on the video, thereby rendering a virtual digital person based on the motion information. The recognition of motion information and the rendering of digital people require strong calculation power, and have high performance requirements on a terminal device. Moreover, the related programs are complicated to install, difficult to operate, and low in availability to users.

In another scheme, as shown in fig. 2, a user collects a video including an image of the user by using a terminal device, and sends the video to a cloud device. The cloud device identifies motion information of the user based on the video, renders the virtual digital person based on the motion information, and then sends a rendering result to the terminal device. And the terminal equipment displays the rendering result. The identification of the motion information and the rendering of the digital person are executed by the cloud device, so that the performance dependence on the terminal device is reduced, the terminal device is only responsible for video acquisition and rendering result display, and related programs are simple to install. However, the data volume of the video is huge, and the video needs to be transmitted to the cloud device in real time, which has high requirements on the bandwidth and stability of the network. And because the data volume of the video is large, the frame rate of the video is too low or frame loss can occur, and the user experience is affected.

Referring to fig. 3, an embodiment of the present application provides a rendering system, including a rendering client 100 and a rendering server 200. The rendering client 100 may acquire an image including the object a, and perform image processing on the acquired image to remove information unrelated to or less related to the motion of the object a, so that the processed image has a smaller data size while retaining the motion related information (e.g., the gesture of the object a) of the object a; then, the processed image is encoded into video frames in a video stream (video streaming), and the video stream is transmitted to the rendering server 200. It is easy to understand that the data volume of the processed image is smaller, so that the data volume of the video stream is also smaller, and the requirement of the video stream transmission on the network performance can be reduced. In addition, after the rendering server 200 receives the video stream, the action information of the object a may be obtained based on the video stream. It will be appreciated that the image used to encode the video stream retains information about the motion of object a and thus, based on the video stream, the motion of object a can be identified. In addition, because the information which is irrelevant to or has a small relation with the action of the object A is removed, the data volume which needs to be processed when the rendering server 200 recognizes the action of the object A is reduced, and the computing resource of the rendering server is saved.

In the embodiment of the present application, the object a may also be referred to as a target object, which is a rendering target of the rendering system. The object a may be an object or a person. The image including the object a specifically means that the image includes a picture of the object a. That is, in the embodiment of the present application, for convenience of description, the image including the object a will be simply referred to as an image including the object a.

Next, the schemes provided in the embodiments of the present application are described by way of example.

In some embodiments, rendering client 100 may be configured on a terminal device, which may be a cell phone, tablet, personal computer (personal computer, PC), smart wearable device, smart television, vehicle-mounted terminal, aerial drone, or the like. Wherein the terminal device may provide a hardware environment for the operation of the rendering client 100.

The terminal device may include one or more memories and one or more processors. The one or more memories may be used to store program instructions (e.g., image processing instructions, etc.) and data (e.g., images acquired by the rendering client 100 and rendering results received from the rendering server 200, etc.) for the rendering client 100. The one or more processors are configured to read program instructions of the rendering client 100 from the memory and execute the program instructions to implement corresponding operations of the rendering client 100, such as image processing operations. In some embodiments, the one or more processors may include a graphics processor (graphics processing unit, GPU). The GPU may perform mathematical and geometric calculations to perform image processing operations. In addition, in the present embodiment, the image processing operation may also be referred to as a target operation.

In one illustrative example, the one or more processors may include an image processor. The image processor is specifically set by a rendering system in which the rendering client 100 is located, and is configured to perform an image processing operation performed by the rendering client 100. In other words, the image processor is a processor in the rendering system for performing image processing operations performed by the rendering client 100.

In one illustrative example, the one or more processors may include an image processor. The image processor may not belong to the rendering system where the rendering client 100 is located. Rendering client 100 may invoke the hardware capabilities or hardware resources of the image processor to perform image processing operations.

The terminal device may be configured with a video camera or a camera. The video or still camera may capture or otherwise acquire an image including object a. For example, object a may be a person, and the image including object a refers to an image including a person. The video camera or the camera can capture or acquire an image corresponding to a face (i.e., an image including a face image), or capture or acquire an image of a body posture of a person (i.e., an image including a body posture of a person). Specifically, when the object a is located within the photographing range of the camera or the camera, the object a may generate an optical image through the lens of the camera and project to the photosensitive element of the camera. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then passed to an image signal processor (image signal processor, ISP) for conversion into a digital image signal. The ISP outputs the digital image signals to a digital signal processor (digital signal processor, DSP) for processing. The DSP converts the digital image signal into an image signal in a standard format of ARGB, RGB, YUV, or the like.

In some embodiments, the terminal device may further comprise a display screen for displaying images or interfaces. With the display screen, the rendering client 100 may present rendering results. The terminal device may also include a communication module. By means of the communication module, the rendering client 100 may interact with the rendering server 200, for example, the rendering client sends a video stream to the rendering server 200. The communication module may be a mobile communication module, for example. The mobile communication module may provide the rendering client 100 with a solution of the 5th generation (5th generation,5G) mobile communication technology, the 4th generation (4th generation,4G) mobile communication technology, or the 3rd generation (3rd generation,3G) mobile communication technology, so that the rendering client 100 may access into the mobile communication network. The communication module may be, for example, a wireless communication module. The wireless communication module may provide a solution for wireless communication such as wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), etc. to the rendering client 100, so that the rendering client 100 may access the network connected to the rendering server 200.

In some embodiments, rendering server 200 may be configured at a remote device. The remote device may be, for example, a remote server deployed at a remote location. For example, the remote device may be one or more computing nodes in a cloud computing (computing) data center. Rendering server 200 may be one or more computing instances deployed on a remote device. The computing instance may be one or more processes, a Virtual Machine (VM), or a container. The specific implementation form of the rendering server 200 is not specifically limited in the embodiments of the present application.

The remote device may provide a hardware environment for the rendering server to run. For example, the remote device may include one or more memories and one or more processors. The one or more memories are used to store program instructions (e.g., motion recognition instructions, target object rendering instructions, etc.) and data (e.g., video streams received from the rendering client 100, motion recognition models, etc.) of the rendering server 200. The one or more processors are configured to read program instructions of the rendering server 200 from the memory and execute the program instructions, thereby implementing corresponding operations of the rendering server 200, such as action recognition operations, graphics rendering operations, and the like. In some embodiments, the one or more processors may include a graphics processor. The remote device may further include a communication module, so that the rendering server 200 may perform information interaction with the rendering client 100 through the communication module, for example, receive a video stream sent by the rendering client 100. The communication module may be implemented by referring to the description of the communication module of the terminal device, which is not described herein.

The above examples introduce the hardware environment or hardware structure of the rendering client 100 and the rendering server 200. Next, examples introduce the software architecture of the rendering client 100 and the rendering server 200.

Referring to fig. 4, the rendering client 100 may include an acquisition module 101, an image processing module 102, and a transmission module 103. Wherein the acquisition module 101 may acquire an image. Wherein the acquisition module 101 may acquire an image captured by a camera or webcam from the camera or webcam. More specifically, when the object a is located within the shooting range of the video camera or the camera, the image captured by the video camera or the camera includes an image of the object a. The captured image may be in an ARGB format, an RGB format, or a YUV format. In some embodiments, object a may be a person. When the face of a person (i.e., a human face) is within the capture range of the camera head or camera, the camera head or camera may capture an image that includes the human face. The image including the face may also be referred to as an image including a human image, or an image corresponding to the face.

In addition, the image acquired by the rendering client 100 is subsequently converted into a video frame in the video, and thus, one image acquired by the rendering client 100 may be referred to as one frame image.

In addition, in the following description, for convenience of description, a picture including an object may be referred to as an image including an object.

The acquisition module 101 may pass the acquired image to the image processing module 102. At the image processing module 102, image processing may be performed on the image to remove information that is irrelevant or less relevant to the motion of the object a, so as to reduce the data amount of the image while maintaining the motion-related information of the object a (for example, the posture of the object A1). Among them, the image processing herein may also be referred to as a target operation. The specific scheme of the image processing or the target operation will be described in the following method embodiments, which are not described herein.

The image processing module 102 may transfer the image after the image processing to the transmitting module 103. The sending module 103 may encode or generate a video stream by using the image after image processing as a video frame of the video stream, to obtain the video stream. I.e. the video stream is encoded or generated from the image processed image. The sending module 103 may send the video stream to the rendering server 200. The video frames in the video stream are images after image processing, and the data size of the video frames is smaller than that of the images without image processing. Therefore, the amount of data of the video stream obtained by encoding the image after the image processing is small compared with the video stream obtained by encoding the image not subjected to the image processing. Therefore, the video stream obtained by encoding the image processed image has low network performance requirements between the rendering client 100 and the rendering server 200.

With continued reference to fig. 4, the rendering server 200 includes a receiving module 201 and an identifying module 202. The receiving module 201 of the rendering server 200 may receive the video stream sent by the rendering client 201, and send the received video stream to the identifying module 202. The identification module 202 may identify motion information of the object based on the video stream. As described above, the video frames in the video stream retain the motion related information of object a. The recognition module 202 can recognize the motion of the object a according to the motion related information of the object a in the video frame, and obtain the motion information of the object a. The specific scheme of action recognition will be described in the following method embodiments, and will not be described herein.

In some embodiments, the rendering server 200 may include a rendering module 203. The rendering module 203 may obtain the motion information of the object a identified by the identifying module 202, and drive the rendering of the virtual object a according to the motion information of the object a, so as to obtain a rendering result. Wherein the virtual object a is a virtualized object corresponding to the object a. Illustratively, the structure of virtual object a is similar or identical to the structure of object a. In one example, object a may be a human being and virtual object a may be a virtual digital person.

The rendering server 200 may further include a sending module 204, configured to send the rendering result generated by the rendering module 203 to the rendering client 100.

With continued reference to fig. 4, the rendering client 100 may include a display module 104 configured to receive and display a rendering result sent by the rendering server 200.

In some embodiments, the rendering server 200 may include an analysis module (not shown). The analysis module can analyze the action information of the object A identified by the identification module 202, so as to monitor the action of the object A.

The above examples introduce the hardware architecture and software architecture of the rendering client 100 and the rendering server 200. Next, the vision-based action recognition method provided in the embodiments of the present application is described by way of example in connection with the rendering client 100 and the rendering server 200.

Referring to fig. 5, the rendering client 100 may receive an operation instruction in performing step 501. The operation instruction may be made for a user. In response to the operation instruction, the rendering client 100 may turn on the acquisition module 101. The acquisition module 101 is initialized to an operating state. The acquiring module 101 in an operating state may perform step 502 to acquire an image. For example, the acquisition module 101 may acquire an image from a camera of a terminal device where the rendering client 100 is located. Wherein when the object a is located in the acquisition range of the camera, an image including the object a may be captured, whereby the acquisition module 101 may acquire the image from the camera. As described above, the rendering client 100 may be configured in or be a common terminal device (for example, a mobile phone, a personal electric energy, a notebook computer, a smart wearable device, a smart television, a vehicle-mounted terminal, etc.), and accordingly, a camera of the terminal device may be a camera configured by the common terminal device, and the captured image may be a color image in a format of a conventional ARGB, RGB, YUV, or the like.

Wherein step 502 may be performed continuously, i.e. the acquisition module 101 may continuously acquire images. For example, the camera of the terminal device may continuously capture images at a preset frame rate. Each time an image of a frame is captured, the image may occur to the acquisition module 101. Wherein, each time the acquisition module 101 acquires an image, the image is subjected to image processing as described in step 502 below.

Next, an example of an image B including the object a will be described.

It may be provided that, in step 502, the acquisition module 101 captures an image B comprising the object a. Then, at step 504, image B is passed to the image processing module 102. At step 504, the image processing module 102 may perform image processing on the image B to obtain the image B. The image processing on the image B specifically includes removing information in the image B, which is irrelevant to or has a smaller relationship with the motion of the object a, and retaining the motion related information of the object a, so that the processed image has a smaller data size while retaining the motion related information of the object a.

The motion related information of the object a refers to information for identifying the motion of the object a. Taking the object a as a human as an example, the motion related information of the object a may be information for calculating the activity amplitude of the skeletal key node by the human posture detection algorithm (human pose estimation). In general, the motion related information of the object a is the pose of the object a. For human beings, the posture is the body posture, which means the body shape when the body acts.

In some embodiments, the image processing or targeting operation may include gray scale processing, i.e., the image processing or targeting operation performed on image B may include removing the color of image B, i.e., converting image B into a gray scale map. It will be appreciated that the motion of object a is generally independent of or less related to the color of object a, or that the motion recognition model is not color sensitive. Therefore, removing the color of image B can reduce the amount of data of image B, and has little or no effect on subsequent motion recognition.

In one illustrative example, image B may be set to an RGB image. The gray values of the pixels in the image B may be calculated using formula (1), thereby converting the image B into a gray map.

Grey＝0.299*R+0.587*G+0.114*B (1)

Where Grey represents the gray value of the pixel, R represents the value of the pixel on the R (red) channel, G represents the value of the pixel on the G (green) channel, and B represents the value of the pixel on the B (blue) channel.

Thus, the image B can be compressed into a single-channel gray scale image, and the data volume of the image B is greatly reduced.

In some embodiments, the image processing or targeting operation may include high pass filtering, i.e., the image processing performed on image B may include high pass filtering, i.e., removing the low frequency information of image B, preserving the high frequency information of image B. The image is composed of low frequency information and high frequency information. The low frequency information may also be referred to as a low frequency signal or a low frequency component, and refers to a place where the image intensity (brightness/gray scale) is smoothly transformed, that is, a place where a large patch of color is formed. The high frequency information may also be referred to as a high frequency signal or a high frequency component, and refers to a place where the intensity (brightness/gray) of an image is drastically changed, that is, an edge or a contour of a figure in the image. The human eye and the motion recognition model are more sensitive to the high frequency information of the image. That is, the high-frequency information is key information for recognizing the posture of the object in the image. In other words, the low frequency information is not critical information for recognizing the posture of the object in the image. The low-frequency information of the image B can be removed, and the data volume of the image B can be greatly reduced under the condition that the action recognition of the object A is not influenced.

In one illustrative example, image B may be high pass filtered using a high pass filter to remove low frequency information of image B. The high-pass filter is used for retaining high-frequency information of the image and discarding low-frequency information of the image.

Wherein the image B is subjected to a frequency domain transform before being subjected to a high pass filtering. Specifically, (-1) is used ^x+y Multiplied by the intensity (luminance/gray) of the pixels in image B, where x, y are the coordinates of the pixels in image B, for spectral centering. Wherein, before the frequency spectrum is centered, the high-frequency information is located around the image, and after the frequency spectrum is centered, the high-frequency information is located in the center of the image. Then, performing discrete Fourier transform on the spectrum-centered image to obtain frequency information F (u, v) of the image, wherein u and v represent pixel coordinates of the spectrum-centered image B. Then, the high-frequency information of the image B is extracted by the formula (2).

G(u,v)＝H(u,v)F(u,v) (2)

Where G (u, v) represents high frequency information of the image B, and H (u, v) represents a high pass filter.

In one example, the high pass filter used may be specifically an exponential high pass filter, as shown in equation (3).

u and v represent pixel coordinates of the spectrum-centered image B, D ₀ For the initialized distance parameter, D represents the distance from the current position to the center position of the frequency domain.

In other examples, the high-pass filter may be a ladder-type high-pass filter, a Butterworth filter, etc., which will not be described in detail herein.

In some embodiments, image processing of image B may include both removing the color of image B and removing the low frequency information of image B. For example, in image processing of image B, the color of image B may be removed first, and then the low frequency information of image B may be removed.

The image B is subjected to the image processing, so that the data volume of the image B can be greatly reduced, and the action related information of the object A is reserved.

The image B subjected to the above-described image processing may be referred to as an image B. With continued reference to fig. 5, the image processing module 102 may pass the image b to the sending module 103. The sending module 103 may perform step 506 to take the image b as a video frame of the video stream C, where the video stream C is sent to the rendering server 200, so that in step 507, the image b may be sent to the rendering server 200 through transmission of the video stream C. It will be appreciated that steps 502-505 are ongoing, and the sending module 103 may take the most recently received image as a video frame of the video stream C each time an image is received, so that the image is sent to the rendering server 200. In some embodiments, the sending module 103 may encode or add the image b to the video stream C according to a preset video streaming protocol. In one example, the sending module 103 may employ a real-time streaming protocol (real time streaming protocol, RTSP) to encode or add the image b to the video stream C. In another example, the sending module 103 may employ a real-time messaging protocol (real time messaging protocol, RTMP) to encode or add images to the video stream C. The video streaming protocol employed to encode or add the image b to the video stream C is not particularly limited in the embodiment of the present application.

With continued reference to fig. 5, the receiving module 201 of the rendering server 200 may receive the video stream C sent by the rendering client 100, and then, in step 508, pass the received video stream C to the identifying module 202. In some embodiments, the receiving module 210 may convert the format of the received video stream C to convert the video stream C into the data format required by the identifying module 202.

The recognition module 202 generally recognizes the gesture of an object in an image, and obtains the motion of the object based on the gesture of the object in a plurality of images that are sequentially adjacent to each other at the capturing time. Thus, the data required by the recognition module 202 is in image format rather than in video stream format or video frame format. I.e. the data format required by the recognition module 202 is an image format. As such, in step 508, the video frames in video stream C may be reconverted to images. Wherein, as described above, the video frames in the video stream C are converted from the image B, and the data amount of the image B is smaller than the data amount of the image B, and therefore, the data amount of the image converted from the video frames in the video stream C is also smaller than the data amount of the image B. Illustratively, the gray-scale map is represented by a two-dimensional matrix. The two-dimensional matrix records the position of each pixel in the gray scale map in the two-dimensional coordinate system and the gray scale value.

The recognition module 202 may execute step 509 to obtain motion information or expression information of the object a based on the video stream C. Specifically, the recognition module 202 may input the video stream C into a motion recognition model or an expression recognition (facial expression recognition, FER) model, and obtain motion information or expression information of the object a through reasoning of the motion recognition model or the facial expression recognition model.

The motion recognition model may be a trained model. In some embodiments, the motion recognition model or the facial expression recognition model may be obtained by training with a gray scale image as a training set and a verification set. The gray level diagram is used as a training set and a verification set, and the action recognition model is trained, so that the calculation resources, such as GPU resources, required by model training can be reduced. In some embodiments, the motion recognition model may be trained on a conventional colored graph (e.g., an RGB format graph) as a training set and a validation set. That is, the scheme of the embodiment of the application can adopt the existing action recognition model, so that the applicability of the scheme is improved.

The video stream C may include a plurality of video frames whose corresponding capturing times of images are sequentially adjacent, and which include motion-related information, such as a pose, of the object a. Thus, the motion information of the object a can be obtained from the motion related information of the object a included in the plurality of video frames. Taking the object a as a person and the action-related information as a gesture as an example, the gesture may refer to a gesture of a person body, such as a standing gesture, a creeping gesture, a running gesture, a jumping gesture, a squatting gesture, and the like. The pose of a person in multiple video frames may be varied reflecting the motion of the person and, as such, the motion information of the person that is available. In some embodiments, the motion recognition model may employ a human gesture detection algorithm (human pose estimation) that calculates the activity of skeletal key nodes based on the plurality of video frames to obtain motion information for the person.

In some embodiments, the action recognition model may be a neural network model. In one example, the motion recognition model may be specifically a three-dimensional (3D) convolutional neural network (convolutional neural networks, CNN).

In other embodiments, the action recognition model may also be another form of model. The implementation form of the motion recognition model is not particularly limited in the embodiment of the present application.

In some embodiments, the facial expression recognition model may be used to recognize facial expressions and obtain expression information according to states or actions of various parts in a face in a video stream.

In some embodiments, the facial expression recognition model may be used to recognize the expression of the person according to the state or the motion of each part in the face in at least one video frame in the video stream, and obtain expression information.

In some embodiments, the facial expression recognition model may be a neural network model. In one example, the facial expression recognition model is a convolutional neural network.

In other embodiments, the facial expression recognition model may also be another form of model. The implementation form of the facial expression recognition model is not particularly limited in the embodiment of the application.

In this way, the recognition module 202 can obtain the motion information or expression information of the object a. The motion information or expression information of the object a may be versatile, such as visual driving, video monitoring, etc.

Next, the use of the motion information of the object a will be described by way of example, taking an application in visual driving as an example.

With continued reference to fig. 5, the rendering server 200 may include a rendering module 203. The rendering module 203 may acquire the motion information of the object a through step 501, and then, in step 511, render the virtual object a according to the motion information of the object a. As described above, the virtual object a corresponds to the object a, and its structure is similar to that of the object a. Taking object a as a human as an example, virtual object a may be a virtual digital person. In step 511, the rendering module 203 may acquire graphics data of the virtual object a, where the graphics data of the virtual object a may include a position, an orientation of a model in a three-dimensional space, and data of vertices, materials, textures, and the like that form the model. The rendering module 203 may render the virtual object a according to the graphics data of the virtual object a and the action information of the object a, to obtain a rendering result. Wherein, in the rendering result, the action of the virtual object a is the same as or similar to the action of the object a. In one example, rendering module 203 may employ a rendering engine to effect rendering of virtual object a. Reference may be made specifically to the description of the prior art and will not be repeated here.

Next, still taking the application in visual driving as an example, the use of expression information of the object a will be described by way of example.

The rendering module 203 may acquire the expression information of the object a through step 501, and then, in step 511, render the virtual object a according to the expression information of the object a. As described above, the virtual object a corresponds to the object a, and its structure is similar to that of the object a. Taking object a as a human as an example, virtual object a may be a virtual digital person. In step 511, the rendering module 203 may acquire graphics data of the virtual object a, where the graphics data of the virtual object a may include a position, an orientation of a model in a three-dimensional space, and data of vertices, materials, textures, and the like that form the model. The rendering module 203 may render the virtual object a according to the graphics data of the virtual object a and the expression information of the object a, to obtain a rendering result. Wherein, in the rendering result, the action of the virtual object a and the expression information of the object a are the same or similar.

The rendering module 203 may pass the rendering result to the sending module 204 of the rendering server 200, via step 512. The transmitting module 204 may transmit the rendering result to the rendering client 100 through step 513. The display module 204 of the rendering client 100 may perform step 514 to display the rendering results, thereby presenting the rendering results based on the action of object a to the user.

In the rendering scheme provided by the embodiment of the application, the rendering client can perform image processing on the acquired image and remove the information which is irrelevant to or has little relation with the action recognition in the image, so that the data volume to be transmitted is greatly reduced and the dependence on the network performance is reduced under the condition that the action recognition or the expression recognition is not influenced. In addition, the image data volume for motion recognition or expression recognition is small, and the computing resources required by motion recognition or expression recognition can be saved.

Based on the rendering scheme described above, embodiments of the present application provide a rendering method. It will be appreciated that the rendering method is combined with the above described rendering scheme, and that specific execution of relevant steps in the rendering method may refer to execution of corresponding steps in the above rendering scheme.

The rendering method can be applied to a rendering client in the terminal equipment, and a rendering system where the rendering client is located also comprises a rendering server. The rendering client may refer to the description implementation of the rendering client 100, and the rendering server may refer to the description implementation of the rendering server 200, which is not described herein.

Referring to fig. 6, the rendering method includes the following steps.

In step 601, a first frame image is acquired, the first frame image comprising a target object. The step 601 may be specifically implemented with reference to the description of the step 502 in fig. 5, which is not repeated herein.

And 603, performing target operation on the first frame image to obtain a second frame image, wherein the data volume of the second frame image is smaller than that of the first frame image. The step 603 may be specifically implemented with reference to the description of the step 504 in fig. 5, which is not repeated herein.

Step 605, the second frame image is sent to the rendering server, so that the rendering server renders the target object. The step 605 may be specifically implemented by referring to the descriptions of the steps 506-511 in fig. 5, and will not be described herein.

In some embodiments, the rendering method may be applied to virtual digital person rendering of scenes. Step 601 further comprises: and acquiring a first frame image comprising the human image from a camera, wherein the camera is used for acquiring an image corresponding to the human face. The camera may be a camera configured in a terminal device where the rendering client is located, may be an independent camera, or may be a camera configured in other devices. Step 601 may also be implemented with reference to the description of step 502 above in fig. 5.

In some embodiments, the target operation includes gray scale processing. Reference is made in particular to the implementation of the description of step 504 in fig. 5 above.

In some embodiments, the target operation includes high pass filtering. Reference is made in particular to the implementation of the description of step 504 in fig. 5 above.

In some embodiments, the rendering method further comprises: and receiving and displaying a rendering result of the second frame image sent by the rendering server. The implementation of steps 512-514 in fig. 5 may be referred to, and will not be described herein.

In some embodiments, the terminal device further comprises an image processor, step 603 comprising: the first frame image is subject to a target operation by an image processor. Reference may be made in particular to the implementation of the description of rendering client 100 in fig. 3 above. And will not be described in detail herein.

The embodiment of the application also provides a rendering engine 700. Wherein the rendering engine 700 may be configured at a terminal device. As shown in fig. 7, the rendering engine 700 includes the following functional modules:

an acquisition module 710, configured to acquire a first frame image, where the first frame image includes a target object;

an operation module 720, configured to perform a target operation on the first frame image, and obtain a second frame image, where a data amount of the second frame image is smaller than a data amount of the first frame image;

and a communication module 730, configured to send the second frame image to the rendering server, so that the rendering server renders the target object.

The acquiring module 710, the operating module 720 and the communication module 730 may be implemented by software, or may be implemented by hardware. Illustratively, the implementation of the acquisition module 710 is described next as an example of the acquisition module 710. Similarly, the implementation of the operation module 720 and the communication module 730 may obtain the implementation of the module 710.

Module as an example of a software functional unit, the acquisition module 710 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container, among others. Further, the above-described computing examples may be one or more. For example, the acquisition module 710 may include code running on multiple hosts/virtual machines/containers. It should be noted that, multiple hosts/virtual machines/containers for running the code may be distributed in the same region (region), or may be distributed in different regions. Further, multiple hosts/virtual machines/containers for running the code may be distributed in the same availability zone (availability zone, AZ) or may be distributed in different AZs, each AZ comprising a data center or multiple geographically close data centers. Wherein typically a region may comprise a plurality of AZs.

Also, multiple hosts/virtual machines/containers for running the code may be distributed in the same virtual private cloud (virtual private cloud, VPC) or in multiple VPCs. In general, one VPC is disposed in one region, and a communication gateway is disposed in each VPC for implementing inter-connection between VPCs in the same region and between VPCs in different regions.

Module as an example of a hardware functional unit, the acquisition module 710 may include at least one computing device, such as a server or the like. Alternatively, the acquisition module 710 may be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (programmable logic device, PLD), etc. The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (generic array logic, GAL), or any combination thereof.

The multiple computing devices included in the acquisition module 710 may be distributed in the same region or may be distributed in different regions. The plurality of computing devices included in the acquisition module 710 may be distributed among the same AZ or may be distributed among different AZ. Likewise, multiple computing devices included in the acquisition module 710 may be distributed in the same VPC or may be distributed among multiple VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.

It should be noted that, in other embodiments, the acquiring module 710 may be used to perform any step in the rendering method shown in fig. 6, the operation module 720 may be used to perform any step in the rendering method shown in fig. 6, the communication module 730 may be used to perform any step in the rendering method shown in fig. 6, the steps that the acquiring module 710, the operation module 720 and the communication module 730 are responsible for implementing may be specified according to needs, and the acquiring module 710, the operation module 720 and the communication module 730 implement different steps in the rendering method shown in fig. 6 respectively to implement all functions of the rendering engine 700.

The present application also provides a computing device 800. As shown in fig. 8, a computing device 800 includes: bus 802, processor 804, memory 806, and communication interface 808. Communication between processor 804, memory 806, and communication interface 808 is via bus 802. Computing device 800 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 800.

Bus 802 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 8, but not only one bus or one type of bus. Bus 804 may include a pathway for transferring information among the various components of computing device 800 (e.g., memory 806, processor 804, communication interface 808).

The processor 804 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).

The memory 806 may include volatile memory (RAM), such as random access memory (random access memory). The processor 804 may also include non-volatile memory (ROM), such as read-only memory (ROM), flash memory, a mechanical hard disk (HDD), or a solid state disk (solid state drive, SSD).

The memory 806 has stored therein executable program codes that the processor 804 executes to implement the functions of the aforementioned acquisition module 710, operation module 720, and communication module 730, respectively, to thereby implement the rendering method shown in fig. 6. That is, the memory 806 has stored thereon instructions for performing the rendering method shown in fig. 6.

Communication interface 803 enables communication between computing device 800 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, etc.

The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.

As shown in fig. 9, the cluster of computing devices includes at least one computing device 900. The same instructions for performing the rendering method shown in fig. 6 may be stored in memory 906 in one or more computing devices 900 in the computing device cluster.

In some possible implementations, portions of the instructions for performing the rendering method shown in fig. 6 may also be stored separately in the memory 906 of one or more computing devices 900 in the computing device cluster. In other words, a combination of one or more computing devices 900 may collectively execute instructions for performing the rendering method shown in FIG. 6.

It should be noted that the memory 906 in different computing devices 900 in the computing device cluster may store different instructions for performing part of the functions of the rendering engine 700, respectively. That is, the instructions stored by the memory 906 in the different computing devices 900 may implement the functionality of one or more of the acquisition module 710, the operation module 720, and the communication module 730.

In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 10 shows one possible implementation. As shown in fig. 10, two computing devices 900A and 900B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the acquisition module 710 are stored in the memory 906 in the computing device 900A. Meanwhile, instructions for performing the functions of the operation module 720 and the communication module 730 are stored in the memory 906 in the computing device 900B.

It should be appreciated that the functionality of computing device 900A shown in fig. 10 may also be performed by multiple computing devices 900. Likewise, the functionality of computing device 900B may also be performed by multiple computing devices 900.

The embodiment of the application also provides another computing device cluster. The connection between computing devices in the computing device cluster may be similar to the connection of the computing device cluster described with reference to fig. 9 and 10. In contrast, the same instructions for performing the rendering method shown in FIG. 6 may be stored in memory 906 of one or more computing devices 900 in the computing device cluster.

Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform the rendering method shown in fig. 6.

Embodiments of the present application also provide a computer-readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform the rendering method shown in fig. 6.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; these modifications or substitutions do not depart from the essence of the corresponding technical solutions from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

1. The rendering method is characterized by being applied to a rendering client in a terminal device, wherein a rendering system where the rendering client is located further comprises a rendering server, and the method comprises the following steps:

acquiring a first frame image, wherein the first frame image comprises a target object;

performing target operation on the first frame image to obtain a second frame image, wherein the data volume of the second frame image is smaller than that of the first frame image;

and sending the second frame image to the rendering server so that the rendering server renders the target object.

2. The method of claim 1, wherein the method is applied to virtual digital person rendering of a scene, the acquiring a first frame image comprising:

And acquiring the first frame image comprising the portrait from a camera, wherein the camera is used for acquiring an image corresponding to the face.

3. A method according to claim 1 or 2, wherein the target operation comprises grey scale processing.

4. A method according to any one of claims 1 to 3, wherein the target operation comprises high pass filtering.

5. The method according to any one of claims 1 to 4, further comprising:

and receiving and displaying the rendering result of the second frame image sent by the rendering server.

6. The method according to any one of claims 1 to 5, wherein the terminal device further comprises an image processor, and wherein the performing the target operation on the first frame image comprises:

and carrying out the target operation on the first frame image by using the image processor.

7. A rendering engine, configured at a terminal device, the rendering engine comprising:

the acquisition module is used for acquiring a first frame image, wherein the first frame image comprises a target object;

the operation module is used for carrying out target operation on the first frame image to obtain a second frame image, and the data volume of the second frame image is smaller than that of the first frame image;

And the communication module is used for sending the second frame image to the rendering server so that the rendering server renders the target object.

8. The rendering engine of claim 7, wherein the rendering engine is applied to virtual digital person rendering scenes, the acquisition module is configured to: and acquiring the first frame image comprising the portrait from a camera, wherein the camera is used for acquiring an image corresponding to the face.

9. The rendering engine of claim 7 or 8, wherein the target operation comprises a gray scale process.

10. The rendering engine of any of claims 7 to 9, wherein the target operation comprises high pass filtering.

11. The rendering engine of any of claims 7 to 10, wherein the communication module is further configured to:

12. The rendering engine of any one of claims 7 to 11, wherein the terminal device further comprises an image processor, the operation module being configured to: and carrying out the target operation on the first frame image by using the image processor.

13. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory;

the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any one of claims 1 to 6.

14. A computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform the method of any of claims 1 to 6.

15. A computer program product containing instructions that, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1 to 6.

16. A rendering system comprising the rendering engine of any one of claims 7 to 12 and a rendering server.