CN115661304A

CN115661304A - Word stock generation method based on frame interpolation, electronic device, storage medium and system

Info

Publication number: CN115661304A
Application number: CN202211244030.8A
Authority: CN
Inventors: 岳强
Original assignee: SHANGHAI YICHUANG INFORMATION TECHNOLOGY CO LTD; Beijing Hanyi Innovation Technology Co ltd
Current assignee: SHANGHAI YICHUANG INFORMATION TECHNOLOGY CO LTD; Beijing Hanyi Innovation Technology Co ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2023-01-31
Anticipated expiration: 2042-10-11
Also published as: CN115661304B

Abstract

The present disclosure relates to a frame interpolation-based word stock generation method, electronic device, storage medium, and system, which uses video data as pre-training data, models a problem of generating multiple-word repeated fonts as a problem of continuous frame interpolation, and fine-tunes network parameters of a constructed convolutional neural network using an existing multiple-word repeated font data set, thereby generating a plurality of different word weights between a coarsest word weight and a finest word weight. The method can greatly shorten the time for manufacturing other characters, not only improves the generation effect, but also enables the effect of generating other characters to be more attractive and reasonable than the character patterns manufactured by a point method, has more obvious style characteristics and is more convenient to read.

Description

Word stock generation method based on frame interpolation, electronic device, storage medium and system

Technical Field

The present disclosure relates to the field of word stocks, and in particular, to a word stock generation method, an electronic device, a storage medium, and a system based on frame interpolation.

Background

The characters are visible everywhere in our life, and are one of the main tools for information transmission, and in order to ensure the effectiveness of information transmission, the characters are presented in different ways and forms on different occasions, different situations and different devices. The multi-character repeated character library is a concept which is presented to meet different display effects, and is a series of characters with the same style and different character repetition. The word weight is the stroke weight of the font, and the international standard ISO defines 9 classes of the word weight, namely, from W1 to W9, the word weight is super-fine, slightly fine, medium, slightly coarse, super-coarse and super-coarse. The variety of the weight of the character can increase the applicability of the font, and one type of font may appear on the small-sized displays of the title, the body, the poster and the embedded device, and if the same weight of the character is used, the expected rendering effect is difficult to achieve. Therefore, making fonts with different font weights is very important to the applicability of the fonts.

The most basic method for making multiple character repeat word stock is that the designer designs a set of word stock first, then adjusts the word repeat by editing the control point of each character, so as to achieve the purpose of modifying the word repeat without reducing the style characteristics of the font and the aesthetic feeling of the face. However, the workload of this method is related to the size of the character set of the word stock and how many characters need to be made, and it usually takes at least several months to make a set of nine-style characters.

In order to reduce the workload of designers and accelerate the manufacture of a multi-character repeated word stock, a common method is a point-to-point method, namely, the designers are enabled to design word stocks with the thickest and thinnest characters, then engineers correspond to the same control points of the same characters in two sets of word stocks, and then the control points are sequentially moved on the positions between the corresponding control points, so that different character repeated effects are realized. However, although the amount of work of a designer is greatly reduced in the case of the peer-to-peer method, the amount of work is reduced from 9 different weights to only 2 weights, but there is a great limitation in the peer-to-peer method. Because the composition structure of each character of the Chinese character and the space gap of components are different, if the position of a control point is mechanically moved according to the proportion, the whole font structure looks uncoordinated, the style of the font is changed, and the universality is not realized, and parameters need to be debugged according to each font. Therefore, the point method is only suitable for a few fonts with simple styles, the application range is small, and the generated word stock is not beautiful enough.

Disclosure of Invention

The present disclosure provides a word stock generation method, an electronic device, a storage medium, and a system based on frame interpolation to at least solve at least one technical problem in the background art described above.

In a preferred embodiment of the present disclosure, an embodiment of the present application provides a word stock generation method based on frame interpolation, where the method includes:

acquiring video data and storing the video data frame by frame into pictures to obtain a video frame data set consisting of the pictures;

pre-training a convolutional neural network by using the video frame data set to obtain a pre-trained convolutional neural network;

fine-tuning the network parameters of the pre-trained convolutional neural network by using the multi-font data set to obtain a fine-tuned convolutional neural network;

and inputting the large repeated character body and the small repeated character body of the same style into the finely tuned convolutional neural network to obtain the multi-character repeated character library of the same style.

Further, the pre-training the convolutional neural network by using the video frame data set to obtain the pre-trained convolutional neural network comprises the following steps:

splicing the pictures of the ith frame and the (i + 2) th frame of the stored video data in the channel dimension to be used as the input of a convolutional neural network;

taking the predicted video frame picture of the (i + 1) th frame as the output of the convolutional neural network;

updating network parameters of the convolutional neural network according to the loss function;

and stopping the pre-training when the pre-trained convolutional neural network is converged to obtain the pre-trained convolutional neural network.

Further, the fine tuning of the network parameters of the pre-trained convolutional neural network using the multi-word repeated font data set to obtain the fine-tuned convolutional neural network includes the following steps:

acquiring a multiple font data set;

taking two character style data separated by one character weight in the multi-character weight character style data set as the input of a pre-trained convolutional neural network to generate character style data with the middle character weight;

updating the network parameters of the convolutional neural network after pre-training according to the loss function;

and stopping fine tuning when the fine tuned convolutional neural network is converged to obtain the fine tuned convolutional neural network.

Further, the loss function uses an average absolute error loss function and a perceptual loss function at the same time, and the calculation formula of the average absolute error loss function and the perceptual loss function is as follows:

L1_Loss＝||G(F _i ，F _i+2 )-F _i+1 || ₁

wherein L1_ Loss represents the average absolute error Loss function, G represents the constructed convolutional neural network G, F _i 、F _i+1 And F _i+2 Video frames respectively representing the ith frame, the (i + 1) th frame and the (i + 2) th frame of video data, LPIPS _ Loss represents a perceptual Loss function, weight _L The coefficients representing the network parameters of the L-th layer, ☉ represents the bitwise multiplication of the coefficients and the characteristics of the network parameters of the L-th layer,

and

and feature maps respectively representing that the result output by the convolutional neural network G and the target video frame of the (i + 1) th frame pass through the L-th layer of the deep convolutional neural network VGG, and the size of the feature maps is h multiplied by w.

Further, the conditions for the convolutional neural network convergence are: respectively calculating the values of an average absolute error loss function and a perception loss function; and summing the two calculated values, and converging the convolutional neural network when the sum of the two values does not decrease.

Further, before pre-training the convolutional neural network using the set of video frame data, constructing a convolutional neural network, the convolutional neural network comprising three modules:

the characteristic compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the input video frame to obtain a compressed 4-dimensional tensor; the characteristic compression coding module consists of a convolution network layer, a characteristic normalization layer, a characteristic activation layer and a down-sampling network layer, wherein the convolution network layer is used for carrying out linear transformation on a video frame to obtain the characteristics of the convolution layer; the characteristic normalization layer is used for carrying out numerical value normalization operation on the characteristics obtained by the convolution layer, so that the range of the characteristics is kept between [ -1,1 ]; the characteristic activation layer is used for carrying out nonlinear transformation on the normalized characteristic to obtain a nonlinear characteristic; the down-sampling network layer is used for performing down-sampling operation on the obtained nonlinear features in the 2 nd and 3 rd dimensions to obtain the nonlinear features with smaller sizes;

the characteristic transfer module is used for carrying out nonlinear transformation on the characteristics extracted by the characteristic compression coding module for multiple times, so that the characteristics after the nonlinear transformation have more characterization capability, and the characteristic transfer module is realized by adding a self-attention mechanism in a residual error layer; the residual error layer is used for accelerating the network convergence speed and stabilizing the output of the network; the self-attention mechanism is used for extracting local features, so that network parameters can be updated conveniently;

the characteristic decoding module is used for decoding the characteristics after nonlinear transformation into an image space and transforming the characteristics into an image to be output; the feature decoding module consists of a two-dimensional reverse convolution network layer, a feature instance normalization layer, a feature activation layer and an up-sampling network layer, wherein the reverse convolution network layer is used for decoding features, and the feature instance normalization layer is used for normalizing feature values to ensure the stability of training; the characteristic activation layer is used for carrying out nonlinear transformation on the normalized characteristic to obtain a nonlinear characteristic; and the up-sampling network layer is used for amplifying the size of the features and complementing the spatial information.

In a preferred embodiment of the present disclosure, an embodiment of the present application further provides a system for generating a word stock based on frame interpolation, including:

the video frame data set generation module is used for acquiring video data and storing the video data into pictures frame by frame to obtain a video frame data set consisting of the pictures;

the pre-training module is used for pre-training the convolutional neural network by using the video frame data set to obtain the pre-trained convolutional neural network;

the fine tuning module is used for fine tuning the network parameters of the pre-trained convolutional neural network by using the multi-character repeated font data set to obtain the fine-tuned convolutional neural network;

and the multi-character repeated character library generating module is used for inputting the large character repeated font and the small character repeated font of the same style into the finely tuned convolutional neural network to obtain the multi-character repeated character library of the same style.

Further, the word stock generation system based on frame interpolation further comprises a convolutional neural network construction module, wherein the convolutional neural network is composed of three modules:

the characteristic compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the 4-dimensional tensor so as to obtain a compressed 4-dimensional tensor; the characteristic compression coding module consists of a convolution network layer, a characteristic normalization layer, a characteristic activation layer and a down-sampling network layer, wherein the convolution network layer is used for carrying out linear transformation on a video frame to obtain the characteristics of the convolution layer; the characteristic normalization layer is used for carrying out numerical value normalization operation on the characteristics obtained by the convolution layer, so that the range of the characteristics is kept between [ -1,1 ]; the characteristic activation layer is used for carrying out nonlinear transformation on the normalized characteristic to obtain a nonlinear characteristic; the down-sampling network layer is used for performing down-sampling operation on the obtained nonlinear features in the 2 nd and 3 rd dimensions to obtain the nonlinear features with smaller sizes;

In a preferred embodiment of the present disclosure, an electronic device is further provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the above-mentioned word stock generation method based on frame interpolation.

In a preferred embodiment of the present disclosure, a computer-readable storage medium is further provided, on which a computer program is stored, and the program, when executed by a processor, implements the steps of the above-mentioned word stock generation method based on frame interpolation.

The beneficial effects of this disclosure are: the present disclosure uses video data as pre-training data, models the problem of multi-weight font generation as a continuous frame interpolation problem, and uses existing multi-weight font data sets to fine tune the network parameters of the constructed convolutional neural network, thereby generating a variety of different weights between the coarsest and finest weights. The method greatly shortens the time for manufacturing other characters, not only improves the generation effect, but also has more attractive and reasonable effect of generating other characters than the characters manufactured by a point method, has more obvious style characteristics and is more convenient to read.

Drawings

FIG. 1 is a flow chart of a method for generating a word stock based on frame interpolation;

FIG. 2 is a block diagram of a convolutional neural network;

FIG. 3 is a block diagram of a word stock generation system based on frame interpolation;

FIG. 4 is a diagram of the effect of multi-character regeneration of font characters between 45-65.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Example 1

Referring to fig. 1, according to the word stock generation method based on frame interpolation provided in the exemplary embodiment of the present disclosure, for the technical problem mentioned in the background art, video data is used as pre-training data, the generation problem of multi-word repeated font is modeled as a continuous frame interpolation problem, and the existing multi-word repeated font data set is used to fine tune the network parameters of the constructed convolutional neural network, so as to adapt to the generation of font domain data and improve the generation effect.

The implementation process of the exemplary frame interpolation-based word stock generation method comprises the following steps:

collecting video data and storing the video data frame by frame in picture format and named with a frame number, F _i 、F _i+1 And F _i+2 Video frames representing the ith frame, the (i + 1) th frame, and the (i + 2) th frame of video data, respectively.

Constructing a convolutional neural network, wherein the convolutional neural network mainly comprises three modules: the device comprises a feature compression coding module, a feature transfer module and a feature decoding module. The module structure of the convolutional neural network is shown in fig. 2, in which:

the characteristic compression coding module is used for compressing an input video frame after the video frame is coded into a 4-dimensional tensor so as to obtain a compressed 4-dimensional tensor, the characteristic compression coding module uses a network structure of { Conv-BN-Relu } xN-Downample, wherein N represents the occurrence times of the { Conv-BN-Relu } module and the Downample, N = [4,8] in the scheme, 2 characteristic compression coding sub-modules Enc are adopted to form the characteristic compression coding module, wherein Conv refers to a convolutional network layer and is used for carrying out linear transformation on the video frame to obtain the characteristics of the convolutional layer; BN refers to a characteristic normalization layer, which is used for carrying out numerical value normalization operation on the characteristics obtained by the convolution layer, so that the range of the characteristics is kept between [ -1,1 ]; relu is a feature activation layer, which is used for carrying out nonlinear transformation on the normalized features to obtain nonlinear features; down sample refers to a down-sampling network layer, which is used to perform down-sampling operation on the obtained nonlinear features in the 2 nd and 3 rd dimensions to obtain smaller-sized nonlinear features, such as: the nonlinear characteristic of (1,1,4,4) dimension can be obtained through a down-sampling layer, and the nonlinear characteristic of (1,1,2,2) dimension can be obtained;

the characteristic transfer module is used for carrying out multiple nonlinear transformation on the characteristics extracted by the characteristic compression coding module, so that the characteristics after the nonlinear transformation have more characterization capability; the characteristic transfer module uses residual structure link, the main flow part uses a self-attention structure, the residual layer uses 1x1 convolution module link, in the scheme, 4 characteristic transfer sub-modules res are used for forming the characteristic transfer module, wherein the residual layer is used for accelerating the network convergence speed and stabilizing the network output; the self-attention mechanism is used for extracting local features, so that network parameters can be updated conveniently;

the characteristic decoding module is used for decoding the characteristics after the nonlinear transformation into an image space and transforming the characteristics into an image for output; the characteristic decoding module is composed of { Conv2 dTransspan-IN-Relu } xM-Upsample, wherein M represents the occurrence frequency of the { Conv2 dTransspan-IN-Relu } module and the Upsample, M = [8,4] IN the scheme, and 2 characteristic decoding submodules Dec are adopted to form the characteristic compression coding module, wherein Conv2 dTransspan refers to a two-dimensional reverse convolution network layer and is used for decoding the characteristics; IN refers to a characteristic instance normalization layer which is used for normalization of characteristic values and ensures the stability of training; relu is a feature activation layer, which is used for carrying out nonlinear transformation on the normalized features to obtain nonlinear features; upsample refers to an upsampling network layer, which is used to enlarge the size of a feature and complete spatial information, such as a feature with one (1,1,2,2) dimension, and after passing through the upsampling layer, the upsampling network layer becomes (1,1,4,4).

And splicing the ith frame and the (i + 2) th frame together in the dimension of the channel to be used as the input of the neural network.

The function of the convolutional neural network is to predict the content of the (i + 1) th frame, and the loss is calculated from the output f _ int of the convolutional neural network and the (i + 1) th frame to update the network parameters of the convolutional neural network.

Calculating the values of the average absolute error loss function L1_ loss and the perception loss function LPIPS _ loss, stopping the pre-training when the sum of the two values does not decrease any more, and calculating formulas of L1_ loss and LPIPS _ loss are as follows:

L1_Loss＝||G(F _i ，F _i+2 )-F _i+1 || ₁

and

respectively representing the result output by the convolutional neural network G and the characteristic diagram of the L-th layer of the deep convolutional neural network VGG of the target video frame of the (i + 1) th frameThe dimension of the figure is h × w, wherein the deep convolutional neural network VGG is a classic convolutional neural network developed by the scientific engineering department of oxford university, and is a convolutional neural network which is trained and completed.

After the convolutional neural network is pre-trained by using video data, fine adjustment of network parameters is carried out on the pre-trained convolutional neural network by using a multi-word repeated font data set, and the purpose of consistent data distribution is achieved. The multi-character repeated font data set mainly adopts a multi-character repeated font designed by a designer, such as Chinese character black and flag, and the total number of characters is 15.

Two font data separated by 1 weight are used as input to generate the font data with the middle weight. The two loss functions used are the same as in the pre-training phase.

And calculating the values of the average absolute error loss function L1_ loss and the perception loss function LPIPS _ loss, so that the sum of the two values is not reduced any more, the pretrained convolutional neural network is converged, and the fine tuning is stopped to obtain the fine-tuned convolutional neural network.

Inputting the large repeated character style and the small repeated character style of the same style into the finely tuned convolutional neural network to obtain the multiple repeated character library of the same style, wherein the effect graph of the generated multiple repeated character style is shown in fig. 4, and the character sizes from left to right are 45, 47, 49,.

Example 2

As shown in fig. 3, an exemplary word stock generation system based on frame interpolation includes:

a video frame data set generation module for collecting video data, storing the video data frame by frame as picture, and using frame number F _i 、F _i+1 And F _i+2 Name wherein F _i 、F _i+1 And F _i+2 Video frames representing the ith frame, the (i + 1) th frame, and the (i + 2) th frame of video data, respectively.

And the pre-training module is used for splicing the ith frame and the (i + 2) th frame together on the dimension of the channel to be used as the input of the neural network. The function of the convolutional neural network is to predict the content of the (i + 1) th frame, and the loss is calculated by the output f _ int of the convolutional neural network and the (i + 1) th frame to update the network parameters of the convolutional neural network, specifically: calculating the values of the average absolute error loss function L1_ loss and the perception loss function LPIPS _ loss, converging the convolutional neural network when the sum of the two values does not decrease, stopping pre-training to obtain the pre-trained convolutional neural network, wherein the calculation formulas of L1_ loss and LPIPS _ loss are as follows:

L1_Loss＝||G(F _i ，F _i+2 )-F _i+1 || ₁

and

and the result output by the convolutional neural network G and the target video frame of the (i + 1) th frame respectively pass through a feature map of the L-th layer of a deep convolutional neural network VGG, and the size of the feature map is h multiplied by w, wherein the deep convolutional neural network VGG is a classic convolutional neural network developed by the scientific engineering department of Oxford university and is a convolutional neural network which is trained.

The fine tuning module is used for carrying out fine tuning of network parameters on the pre-trained convolutional neural network by using the multi-character repeated font data set so as to achieve the purpose of consistent data distribution, and specifically comprises the following steps: and taking two character style data separated by 1 character weight in the multi-character weight character style data set as input, generating character style data with middle character weight, calculating the values of an average absolute error loss function L1_ loss and a perception loss function LPIPS _ loss, converging the pretrained convolution neural network when the sum of the two values does not decrease any more, and stopping fine tuning to obtain the fine-tuned convolution neural network. The multi-character repeated font data set mainly adopts a multi-character repeated font designed by a designer, such as Chinese character black and flag, and the total number of characters is 15. The two loss functions used by the fine tuning module are the same as in the pre-training module described above.

Further, the word stock generation system based on frame interpolation further comprises a convolutional neural network, wherein the convolutional neural network mainly comprises three modules: the device comprises a feature compression coding module, a feature transfer module and a feature decoding module. The modular structure of the convolutional neural network is shown in fig. 2, in which:

the characteristic compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the input video frame to obtain a compressed 4-dimensional tensor, wherein the characteristic compression coding module uses a network structure of { Conv-BN-Relu } xN-Downample, N represents the number of times of occurrence of the { Conv-BN-Relu } module and the Downample, N = [4,8] in the scheme, and 2 characteristic compression coding sub-modules Enc are adopted to form the characteristic compression coding module, wherein Conv refers to a convolutional network layer and is used for carrying out linear transformation on the video frame to obtain the characteristics of a convolutional layer; BN refers to a feature normalization layer, and is used for carrying out numerical value normalization operation on the features obtained by the convolution layer, so that the range of the features is kept between [ -1,1 ]; relu is a feature activation layer, which is used for carrying out nonlinear transformation on the normalized features to obtain nonlinear features; down sample refers to a down-sampling network layer, which is used to perform down-sampling operation on the obtained nonlinear features in the 2 nd and 3 rd dimensions to obtain smaller-sized nonlinear features, such as: the nonlinear characteristic of (1,1,4,4) dimension can be obtained through a down-sampling layer, and the nonlinear characteristic of (1,1,2,2) dimension can be obtained;

the characteristic transfer module is used for carrying out multiple nonlinear transformation on the characteristics extracted by the characteristic compression coding module, so that the characteristics after the nonlinear transformation have more characterization capability; the characteristic transfer module uses residual structure link, the main process part uses a self-attention structure, the residual layer uses 1x1 convolution module link, in the scheme, 4 characteristic transfer sub-modules res are used to form the characteristic transfer module, wherein the residual layer is used for accelerating the network convergence speed and stabilizing the network output; the self-attention mechanism is used for extracting local features, so that network parameters can be updated conveniently;

the characteristic decoding module is used for decoding the characteristics after the nonlinear transformation into an image space and transforming the characteristics into an image for output; the characteristic decoding module is composed of { Conv2 dTransspan-IN-Relu } xM-Upsample, wherein M represents the occurrence frequency of the { Conv2 dTransspan-IN-Relu } module and the Upsample, M = [8,4] IN the scheme, and 2 characteristic decoding submodules Dec are adopted to form the characteristic compression coding module, wherein Conv2 dTransspan refers to a two-dimensional reverse convolution network layer and is used for decoding the characteristics; IN refers to a characteristic instance normalization layer which is used for normalizing characteristic values and ensuring the stability of training; relu is a feature activation layer, which is used for carrying out nonlinear transformation on the normalized features to obtain nonlinear features; the Upsample refers to an upsampling network layer, and is used for enlarging the size of a feature and completing spatial information, for example, a feature with one (1,1,2,2) dimension, which becomes (1,1,4,4) after passing through the upsampling layer.

Example 3

An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the frame interpolation based word stock generation method of embodiment 1 when executing the computer program.

Embodiment 3 of the present disclosure is merely an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.

The electronic device may be embodied in the form of a general purpose computing device, which may be, for example, a server device. Components of the electronic device may include, but are not limited to: at least one processor, at least one memory, and a bus connecting different system components (including the memory and the processor).

The buses include a data bus, an address bus, and a control bus.

The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may further include read-only memory (ROM).

The memory may also include program means having a set of (at least one) program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor executes various functional applications and data processing by executing computer programs stored in the memory.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, etc.). Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through a network adapter. The network adapter communicates with other modules of the electronic device over the bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, to name a few.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, according to embodiments of the application. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 4

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the frame interpolation-based word stock generation method of embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of implementing the frame interpolation based word stock generation method described in embodiment 1, when the program product is run on the terminal device.

Where program code for carrying out the disclosure is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

Although embodiments of the present disclosure have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations may be made in these embodiments without departing from the principles and spirit of the disclosure, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for generating a word stock based on frame interpolation is characterized by comprising the following steps:

fine-tuning network parameters of the pre-trained convolutional neural network by using the multi-word repeated font data set to obtain a fine-tuned convolutional neural network;

2. The method for generating a word stock based on frame interpolation according to claim 1, wherein the pre-training of the convolutional neural network using the video frame data set to obtain a pre-trained convolutional neural network comprises the following steps:

3. The method of generating a word stock based on frame interpolation of claim 1, wherein the fine tuning of the network parameters of the pre-trained convolutional neural network using the multigram data set to obtain the fine tuned convolutional neural network comprises the steps of:

acquiring a multi-font data set;

updating the network parameters of the pre-trained convolutional neural network according to the loss function;

4. The frame interpolation based word stock generating method according to claim 2 or 3,

the loss function simultaneously uses an average absolute error loss function and a perception loss function, and the calculation formulas of the average absolute error loss function and the perception loss function are as follows:

L1_Loss＝||G(F _i ，F _i + ₂ )-F _i+1 || ₁

and

5. The method of generating a frame interpolation based word stock according to claim 4, wherein the condition for convergence of the convolutional neural network is:

respectively calculating the values of the average absolute error loss function and the perception loss function;

the two calculated values are summed and when the sum of the two values no longer falls, the convolutional neural network converges.

6. The method of generating a frame interpolation based word stock of claim 1, further comprising constructing a convolutional neural network prior to pre-training the convolutional neural network using the set of video frame data, the convolutional neural network consisting of three modules:

the characteristic compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the input video frame to obtain a compressed 4-dimensional tensor;

the characteristic transfer module is realized by adding a self-attention mechanism in the residual error layer and is used for carrying out multiple times of nonlinear transformation on the characteristics extracted by the characteristic compression coding module;

and the characteristic decoding module is used for decoding the characteristics output by the characteristic transferring module into an image space and converting the characteristics into an image to be output.

7. A system for generating a word stock based on frame interpolation, comprising:

the video frame data set generating module is used for acquiring video data and storing the video data into pictures frame by frame to obtain a video frame data set consisting of the pictures;

8. The frame interpolation based word stock generation system of claim 7, further comprising a convolutional neural network construction module, the convolutional neural network consisting of three modules:

the characteristic compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the 4-dimensional tensor so as to obtain a compressed 4-dimensional tensor;

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the frame interpolation based word stock generation method of any of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the frame interpolation based word stock generation method of any one of claims 1 to 6.