CN116245986A

CN116245986A - Virtual sign language digital person driving method and device

Info

Publication number: CN116245986A
Application number: CN202211712508.5A
Authority: CN
Inventors: 吴熙; 刘佳; 王路路; 冉沿川; 陆弘锴; 王雪杨; 彭钰婷; 马梦瑶
Original assignee: Beijing Zhipu Huazhang Technology Co ltd
Current assignee: Beijing Zhipu Huazhang Technology Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-06-09

Abstract

The application provides a virtual sign language digital person driving method, which relates to the technical field of computers, wherein the method comprises the following steps: constructing a standard virtual sign language digital human skeleton system and constructing an action information base based on the skeleton system, wherein the action information base comprises a face action information base and a body action information base; receiving an external instruction, converting the external instruction into a sign language word sequence, and extracting action information corresponding to the sign language word sequence from an action information base; and rendering the sign language action frame based on the action information and a preset pre-rendering frame number, and synthesizing the sign language action frame into a sign language video based on a preset fps value. The invention adopting the scheme realizes the wide application of the virtual sign language digital person in different application scenes.

Description

Virtual sign language digital person driving method and device

Technical Field

The application relates to the technical field of computers, in particular to a virtual sign language digital person driving method and device.

Background

Virtual digital people are technologies that use computer multimedia technology, man-machine interaction technology, virtual reality technology to generate a real human-like real life, which can simulate the actions of a real human to express information using expressions, arms, and bodies without speaking. The characteristics enable the device to be capable of serving some hearing impaired people well, and have great application potential in live broadcasting, game, shopping and tour guide.

Generally, a virtual digital person needs to perform modeling and then rendering, and can only smile or simply stand, so that fewer actions are needed, less information is transmitted by using the actions, and the virtual digital person simply plays a decoration role and cannot well serve the specific requirements of a human society. In practical application, a section of text with practical meaning is firstly required to be input to a virtual digital person, then the text is converted into sign language semantics by a driving engine, the sign language semantics are input into a rendering engine to render and drive the virtual digital person into images of each frame, and then the driving engine combines the rendered images to be synthesized into a video according to a certain frame rate. However, there is basically no virtual digital person capable of communicating information through sign language in the market at present, and no complete set of mature driver and rendering system exists.

Disclosure of Invention

The present application aims to solve, at least to some extent, one of the technical problems in the related art.

Therefore, a first object of the present application is to provide a virtual sign language digital person driving method, which solves the technical problem that the existing method cannot realize the driving of the virtual digital person systematically, and realizes the wide application of the virtual sign language digital person in different application scenarios.

A second object of the present application is to propose a virtual sign language digital person driving device.

To achieve the above object, an embodiment of a first aspect of the present application provides a virtual sign language digital person driving method, including: constructing a standard virtual sign language digital human skeleton system and constructing an action information base based on the skeleton system, wherein the action information base comprises a face action information base and a body action information base; receiving an external instruction, converting the external instruction into a sign language word sequence, and extracting action information corresponding to the sign language word sequence from an action information base; and rendering the sign language action frame based on the action information and a preset pre-rendering frame number, and synthesizing the sign language action frame into a sign language video based on a preset fps value.

According to the virtual sign language digital person driving method, a set of universal virtual digital person skeleton system is firstly constructed, the multi-skeleton system is suitable for animation simulation of figures and other vertebrates, the required memory space is smaller than that of other animation schemes, meanwhile, the system can be driven in a rendering engine in real time, animation effects can be previewed in real time, and some problems in the animation effects can be timely modified. Thus, the virtual sign language digital person can simultaneously and coordinately call the coordinated movements of the face, the double arms, the ten fingers and the body, and vividly convey the semantics to the outside.

Optionally, in one embodiment of the present application, constructing a standard virtual sign language digital human skeletal system includes:

a standard virtual sign language digital human skeletal system consisting of 119 bones including facial bones, double-arm bones and somatic bones is constructed on the basis of root bones.

Optionally, in one embodiment of the present application, the facial motion information base includes eye motion information, eyelash motion information, mouth motion information, and the eye motion information includes micro motion information of blinking, eye opening, eye closing, and squinting; the eyelash action information comprises micro action information of picking eyebrows; the mouth motion information comprises motion information of complete opening, half opening, closing, pout, tucking and skimming;

the body motion information base comprises double-arm motion information, double-arm motion information and trunk motion information, wherein the double-arm motion information comprises motion information of motions in standard general sign language and sign language motions in the sports field; the double-hand motion information comprises motion information of ten fingers; the physical movement information includes side-to-side and upright movement information.

Optionally, in one embodiment of the present application, further includes: the prerendering frame number and the fps value are adjusted to meet the requirements of different scenes, wherein the prerendering frame number controls the number of action frames of each action after rendering, and the fps value controls the speed of sign language video display.

To achieve the above object, a second aspect of the present invention provides a virtual sign language digital person driving apparatus, which includes a standard information construction module, an action information acquisition module, and a video generation module, wherein,

the standard information construction module is used for constructing a standard virtual sign language digital human skeleton system and constructing an action information base based on the skeleton system, wherein the action information base comprises a face action information base and a body action information base;

the action information acquisition module is used for receiving an external instruction, converting the external instruction into a sign language word sequence, and extracting action information corresponding to the sign language word sequence from the action information base;

the video generation module is used for rendering sign language action frames based on the action information and a preset pre-rendering frame number and synthesizing the sign language action frames into sign language videos based on a preset fps value.

Optionally, in an embodiment of the present application, a personalization module is further included, specifically for: the prerendering frame number and the fps value are adjusted to meet the requirements of different scenes, wherein the prerendering frame number controls the number of action frames of each action after rendering, and the fps value controls the speed of sign language video display.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a virtual sign language digital person driving method according to an embodiment of the present application;

FIG. 2 is an exemplary diagram of an action information base according to an embodiment of the present application;

FIG. 3 is a schematic flow diagram of a driving and rendering system according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a virtual sign language digital man driving device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

The following describes a virtual sign language digital person driving method and device according to the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flow chart of a virtual sign language digital person driving method according to an embodiment of the present application.

As shown in fig. 1, the virtual sign language digital person driving method includes the steps of:

step 101, constructing a standard virtual sign language digital human skeleton system and constructing an action information base based on the skeleton system, wherein the action information base comprises a face action information base and a body action information base;

step 102, receiving an external instruction, converting the external instruction into a sign language word sequence, and extracting action information corresponding to the sign language word sequence from an action information base;

and 103, rendering a sign language action frame based on the action information and a preset pre-rendering frame number, and synthesizing the sign language action frame into a sign language video based on a preset fps value.

According to the virtual sign language digital person driving method, a set of universal virtual digital person skeleton system is firstly constructed, the multi-skeleton system is suitable for animation simulation of figures and other vertebrates, the required memory space is smaller than that of other animation schemes, meanwhile, the system can be driven in a real-time rendering engine, animation effects can be previewed in real time, and some problems in the animation effects can be timely modified. Thus, the virtual sign language digital person can simultaneously and coordinately call the coordinated movements of the face, the double arms, the ten fingers and the body, and vividly convey the semantics to the outside.

The following describes the virtual sign language digital person driving system of the embodiment of the present application as a preferred embodiment, where the virtual sign language digital person driving system of the present embodiment includes two parts, namely, construction of an action information base, construction and application of a driving and rendering system, specifically:

the embodiment of the application firstly provides a method for constructing an action information base of a virtual sign language digital person, which comprises the following steps:

step 1, constructing a unified virtual sign language digital human skeleton system.

The embodiment of the application provides and designs a general virtual digital human skeleton system, which is based on root skeletons and constructs a virtual sign language digital human somatic skeleton system consisting of 119 skeletons such as facial skeletons, double-arm skeletons, double-hand skeletons, double-leg skeletons, somatic skeletons and the like, wherein the virtual sign language digital human somatic skeleton system comprises two parts of facial skeletons and somatic skeletons.

And 2, constructing an action information base based on a unified skeleton system.

The embodiment of the application provides a method for constructing an action information base of a virtual sign language digital person. The method takes national general sign language standard vocabulary as a main part, and simultaneously integrates related sign language action gestures in the sports field. The information base consists of a face data information base and a body action information base.

Wherein the facial skeleton part comprises action states of eyelashes, eyes, mouth, facial muscles and other parts; the somatic bone parts include double arm bones, ten finger bones, trunk body vertebra bones and neck bones. The multi-skeleton system is particularly suitable for animation simulation of people and other vertebrates, and requires less memory space than other animation schemes, and meanwhile, the design enables the multi-skeleton system to be driven in a real-time rendering engine, previews animation effects in real time and timely modifies some problems in the animation effects. Thus, the virtual sign language digital person can simultaneously and coordinately call the coordinated movements of the face, the double arms, the ten fingers and the body, and vividly convey the semantics to the outside.

Likewise, using this complete set of skeletal systems as a standard, the present application may drive different avatar images without modifying the underlying data format. Because the later-stage character images are obtained based on the skin of the skeleton system, the action data of the foundation is only acquired in the design of the bottom data information base, and different virtual digital character images can be driven by the same set of driving engine, so that the purposes of high efficiency and reusability of the data and the driving system are achieved, the complexity brought by large-scale popularization of businesses is greatly reduced, and the cost of the whole set of system business can be reduced. In the later stage, only an external image model of a virtual digital person is required to be manufactured, and then the skin is covered on a skeletal system to realize driving.

In order to further more clearly explain the action information in step 2, data therein will be described in detail below.

As shown in fig. 2, the motion information base 100 includes a face data information base 101 and a body motion information base 102, which cooperate to express sign language semantics.

The face data information base 101 includes eye movement information, eyelash movement information, and mouth movement information. The eye motion information comprises micro motion information such as blinking, eye opening, eye closing, squinting and the like; the eyelash action information comprises micro action information of picking eyebrows; the mouth motion information comprises six motion information of fully opening, half opening, closing, pout, tucking and skimming. In sign language expression, facial expression is matched with hand action, and the change of the sign language expression is true and clear of the facial expression, and meanwhile, the sign language expression can vividly reflect the true hand action, so that the sign language can be simply and easily understood.

The body motion information base 102 includes both-arm motion information, both-hand motion information, and torso motion information. The double-arm action information comprises actions in a series of national standard universal sign language such as lifting, sagging and the like, and sign language actions in the sports field; the motion of the two hands comprises motion information of ten fingers, each finger can independently act, and the corresponding semantic information is represented; the motion information of the body only comprises left and right swing and standing.

The facial data information base and the body motion information base in the embodiment of the application are sampled from the translator with sign language translation qualification documents through professional facial motion capturing devices and body motion capturing devices. In order to make the motion transition between each moment smoother, motion smoothing algorithm processing is performed among each sign language motion. The algorithm calculates between any two large rigid deformations, and supplements interpolation data between the two large rigid deformations by adopting a mean interpolation method, so that each displayed sign language action is very smooth in connection transition, and the whole set of actions can be smooth and natural, and is not visually pause.

In order to drive the virtual digital person by using the information in the above embodiments, the application constructs a complete set of mature driving and rendering systems, see fig. 3 below, which details the flow direction and processing of data in the system.

As can be seen from fig. 3, the entire driving and rendering system is divided into three parts: a drive engine, a local server module, and a rendering engine.

The drive engine can be divided into two modules: the device comprises an instruction receiving module and a driving data analyzing module. The instruction receiving module is mainly responsible for receiving and analyzing external text instructions and converting the external text instructions into a hand word sequence containing sign language semantics. Specifically, the outside world can directly input the sentence "i like travel" to the instruction receiving module because it is interesting ", at this time, the instruction receiving module can convert the sentence into the sign language word sequence of" travel/like/this/interesting/there/"with the same meaning through analysis and escape, and then can further transmit the sign language word sequence to the driving data analyzing module.

After the well-analyzed word sequence is obtained, the driving data analysis module is required to search the action information base constructed in the third application according to the five words, so as to obtain corresponding facial action information and body action information data, wherein each word has the facial action information base and the body action information base at the same time. In this embodiment, all virtual digital persons use the same set of skeletal systems, i.e., the same motion information base can be used to drive different digital person images. Aiming at each hand word in the hand word sequence, the analysis module searches the information base, and simultaneously converts the action information into driving data of the skeleton system by combining the driving characteristics of the skeleton system, and continuously sends different driving data of the hand word to the rendering engine, so that the digital person can be driven to realize the appointed action. Meanwhile, the pre-rendering time of each action can be set independently in the driving data analysis module, so that the number of rendering frames of the same action can be increased or reduced by increasing or reducing the pre-rendering time after being transferred to the rendering engine. In this example, the number of frames to be rendered for each action may be calculated in advance according to a pre-rendering algorithm (see the formula below), and the rendering engine may sample a certain number of frames to render according to the pre-rendering engine.

F＝(T*fps)/1000

Wherein, T represents the pre-rendering frame number, each sign language action contains a pre-rendering frame number, and the value is recorded and stored in an action information base when the action information is acquired by adopting the dynamic capture equipment. fps represents the frame rate value in the method presented in application five. In this example its value represents the number of frames per second that the action frames rendered by the rendering engine are composited. When the number of prerendered frames is fixed, the more frames are combined per second, the shorter the action display time is displayed, but the faster the action speed is displayed, and the effect displayed in human eyes is relatively rapid. In some special situations, the display time is critical, for example, when an event is live or an advertisement is live, sign language actions are usually displayed in cooperation with the speaking speed of a person, and the fps value needs to be increased to make the sign language actions quickly display the meaning to be expressed.

However, in the occasions such as question-answering and science popularization, under the condition that the requirement on the speed of sign language is not very high at this moment, people prefer to see a smooth, smooth and fine action display, the pre-rendering frame number can be increased, so that the rendering engine performs rendering for several times in a plurality of samples in one action, the number of the rendered action frame is relatively large, the action capturing is finer, and the effect of sign language action display synthesized at the fixed frame rate is very smooth and fluent.

The rendering engine is mainly responsible for receiving action information data transmitted from a local server so as to render action change states of virtual digital people, wherein the data in the facial action information base and the body action information base need to be synchronously corresponding at the same moment, and the rendering engine receives two kinds of information synchronously transmitted by the receiving driving system, so that facial expression and body action information are mutually supplemented, and the two kinds of information complement each other, and people with hearing disabilities can understand the semantics expressed by sign language digital people more easily and quickly. Meanwhile, the rendering engine also stores related information of the virtual digital human model. Including facial feature information, hair information, extrinsic wear information, and the like. Meanwhile, the rendering engine can render virtual digital people with different definition, including but not limited to 720P, 630P, 2K,4K,6K and other definition virtual task images, and can also select more complex rendering settings, including but not limited to high-precision antialiasing settings and rendering transparent background virtual person settings. In addition, the application also provides the avatar selections with different proportions, and two selections including the upper body avatar and the whole body avatar can be provided according to the needs. The upper virtual digital human image can be used for scenes such as movies, live television broadcast, advertisements, event comments and the like; the whole-body virtual digital human figure can be used for exhibition, cinema, museum and other large-size display containers. Meanwhile, the external image of the virtual digital person can be changed according to specific requirements, so that the virtual sign language digital person with high precision, multiple scenes and multiple purposes is rendered.

The local server module is mainly responsible for the storage of the action information base and the query service. All sign language action information in the national general sign language standard and standard sign language action information in the sports field are stored in the device, and the device mainly comprises facial action information and body action information. The motion information is to apply for a motion state sequence of a skeleton system, and the motion sequence coordinate change condition of each skeleton is stored in detail. Because of the relatively large number of bones, one sign word needs to mobilize 119 bones simultaneously, and because the number of general words is quite large, the capacity required for storing information bases is huge. Generally, the facial motion data and the body motion data of about 9000 words can reach about 20GB, and the disk capacity required for adding other new sign language words at a later stage is very large, so that the LMDB database is specially adopted to store corresponding motion information in the application. LMDB data is a very fast memory mapped database that uses memory mapped files, and is particularly well suited for very fast query and insertion operations in a large number of files that occupy a large amount of memory.

The driving and rendering system constructed by the method can render high-quality virtual sign language digital human action frames with different styles in the rendering engine through the support of the rendering engine and the driving engine system. The number of frames in which the rendering engine samples the sign language action can be controlled by the pre-rendering frame number set in advance. The method comprises the steps of obtaining facial motion data and body motion data from a motion information base through id of a certain word, sending the facial motion data and the body motion data into a rendering engine, performing high-precision rendering according to a preset pre-rendering frame number, obtaining motion state sequence frames of the motion, synthesizing the motion state sequence frames according to fps values in a fourth application, and obtaining complete video demonstration of the motion, namely, a virtual sign language digital person simultaneously combines facial expression change and body motion change to express the semantics of the word. Simultaneously, in order to adapt to two different scenes sensitive to the display time and sensitive to the display effect, two schemes are provided in the application: the first pre-rendering frame number is proper, the fps value is increased, so that the synthesized sign language video keeps the semantic expression as complete as possible, and meanwhile, the display time of the sign language words is shortened, and the method is suitable for scenes with high requirements on speed such as video live broadcast and the like; the second method increases the pre-rendering frame number, and properly maintains the fps value, so that the semantic display details of the synthesized sign language video are reserved to the greatest extent, and the connection flow between actions is smooth, so that the method is suitable for scenes which are not very sensitive to time, such as question-answering, popular science and other man-machine interaction.

In order to achieve the above embodiment, the present application further proposes a virtual sign language digital person driving device.

As shown in fig. 4, the virtual sign language digital man driving device comprises a standard information construction module, an action information acquisition module and a video generation module, wherein,

It should be noted that the foregoing explanation of the embodiment of the virtual sign language digital person driving method is also applicable to the virtual sign language digital person driving device of the embodiment, and will not be repeated herein.

In the description of the present specification, a description referring to the terms "one embodiment," "some embodiments," "examples," "particular examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A virtual sign language digital person driving method, comprising the steps of:

constructing a standard virtual sign language digital human skeleton system and constructing an action information base based on the skeleton system, wherein the action information base comprises a face action information base and a body action information base;

receiving an external instruction, converting the external instruction into a hand word sequence, and extracting action information corresponding to the hand word sequence from the action information base;

and rendering a sign language action frame based on the action information and a preset pre-rendering frame number, and synthesizing the sign language action frame into a sign language video based on a preset fps value.

2. The method of claim 1, wherein said constructing a standard virtual sign language digital human skeletal system comprises:

3. The method of claim 1, wherein the facial motion information base comprises eye motion information, eyelash motion information, mouth motion information, the eye motion information comprising eye blink, open eye, close eye, squint micro motion information; the eyelash action information comprises micro action information of picking eyebrows; the mouth motion information comprises motion information of complete opening, half opening, closing, mouth break, mouth tucking and mouth skimming;

the body motion information base comprises double-arm motion information, double-hand motion information and trunk motion information, wherein the double-arm motion information comprises motion information of motions in standard general sign language and sign language motions in the sports field; the double-hand motion information comprises motion information of ten fingers; the physical movement information includes side-to-side and upright movement information.

4. The method as recited in claim 1, further comprising: and adjusting the pre-rendering frame number and the fps value to meet the requirements of different scenes, wherein the pre-rendering frame number controls the number of action frames of each action after rendering, and the fps value controls the speed of sign language video display.

5. The virtual sign language digital man driving device is characterized by comprising a standard information construction module, an action information acquisition module and a video generation module, wherein,

the action information acquisition module is used for receiving an external instruction, converting the external instruction into a hand word sequence, and extracting action information corresponding to the hand word sequence from the action information base;

the video generation module is used for rendering sign language action frames based on the action information and a preset pre-rendering frame number, and synthesizing the sign language action frames into sign language videos based on a preset fps value.

6. The apparatus of claim 5, wherein said constructing a standard virtual sign language digital human skeletal system comprises:

7. The apparatus of claim 5, wherein the facial motion information library comprises eye motion information, eyelash motion information, mouth motion information, the eye motion information comprising eye blink, open eye, close eye, squint micro motion information; the eyelash action information comprises micro action information of picking eyebrows; the mouth motion information comprises motion information of complete opening, half opening, closing, mouth break, mouth tucking and mouth skimming;

8. The apparatus of claim 5, further comprising a personalization module, in particular for: and adjusting the pre-rendering frame number and the fps value to meet the requirements of different scenes, wherein the pre-rendering frame number controls the number of action frames of each action after rendering, and the fps value controls the speed of sign language video display.