CN115661304B

CN115661304B - Word stock generation method based on frame interpolation, electronic equipment, storage medium and system

Info

Publication number: CN115661304B
Application number: CN202211244030.8A
Authority: CN
Inventors: 岳强
Original assignee: SHANGHAI YICHUANG INFORMATION TECHNOLOGY CO LTD; Beijing Hanyi Innovation Technology Co ltd
Current assignee: SHANGHAI YICHUANG INFORMATION TECHNOLOGY CO LTD; Beijing Hanyi Innovation Technology Co ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2024-05-03
Anticipated expiration: 2042-10-11
Also published as: CN115661304A

Abstract

The present disclosure relates to a method, an electronic device, a storage medium, and a system for generating a word stock based on frame interpolation, which model a multi-word heavy font generation problem as a continuous frame interpolation problem using video data as pre-training data, and fine-tune network parameters of a constructed convolutional neural network using an existing multi-word heavy font data set, thereby generating a plurality of different word weights between a coarsest word weight and a finest word weight. The method and the device can greatly shorten the time required for manufacturing other character weights, not only improve the generation effect, but also enable the effect of generating other character weights to be more attractive and reasonable than that of the character shape manufactured by the dot method, and are more obvious in style characteristics and more convenient to read.

Description

Word stock generation method based on frame interpolation, electronic equipment, storage medium and system

Technical Field

The disclosure relates to the field of word stock, in particular to a word stock generation method, electronic equipment, storage medium and system based on frame interpolation.

Background

The characters are visible everywhere in our life, and are one of the main tools for information transmission, so that the characters are in different situations and different devices in different modes and forms in order to ensure the information transmission effectiveness. The multi-word repeated word library is a concept which is a series of fonts with the same style and different word weights and aims at meeting different display effects. The font weight is the stroke weight of the font, and the international standard ISO specifies 9 classes of font weights, from W1-W9, sequentially superfine, very fine, slightly fine, medium, slightly coarse, very coarse and extra coarse. The diversity of word weights can increase the applicability of fonts, one of which may appear on small-sized displays for titles, text, posters and embedded devices, which can be difficult to achieve with the same word weight. Therefore, the production of fonts with different weights is important for the applicability of fonts.

The most basic method for making multi-word repeated word stock is that a designer firstly designs a set of word stock, and then adjusts the word weight by editing the control point of each character, so that the style characteristics and aesthetic feeling of the characters are not reduced while the word weight is modified. However, the effort of this method is related to the character set size of the word stock and how many word weights need to be made, and it usually takes at least several months to make a set of nine word weights of a style.

In order to reduce the workload of a designer and accelerate the production of a multi-word heavy word stock, a common method is a point-to-point method, namely, the designer designs the two thickest and thinnest word stocks, then the engineer corresponds to the same control points of the same characters in the two word stocks according to the same control points, and then the control points are sequentially moved at positions between the corresponding control points, so that different word weight effects are realized. However, the peer-to-peer method greatly reduces the workload of a designer, and shortens the workload from designing 9 different word weights to only designing 2 word weights, but has great limitation. Because the composition structure of each character of the Chinese character and the space gap of the components are different, if the position of the control point is mechanically moved according to the proportion, the whole font structure looks uncoordinated, the style of the fonts is also changed, and the universality is not achieved, and the parameters are required to be debugged for each font. Therefore, the point-to-point method is only suitable for few fonts with simple styles, has small application range and generates a font library with insufficient aesthetic degree.

Disclosure of Invention

The present disclosure provides a method, an electronic device, a storage medium, and a system for generating a word stock based on frame interpolation, so as to at least solve at least one technical problem in the above background art.

In a preferred embodiment of the present disclosure, an embodiment of the present application provides a method for generating a word stock based on frame interpolation, where the method includes:

Acquiring video data and storing the video data into pictures frame by frame to obtain a video frame data set consisting of the pictures;

using the video frame data set to pretrain the convolutional neural network to obtain a pretrained convolutional neural network;

The network parameters of the convolutional neural network after the pre-training are finely tuned by using the multi-character heavy font data set, and the finely tuned convolutional neural network is obtained;

and inputting the large character re-fonts and the small character re-fonts of the same style into the fine-tuned convolutional neural network to obtain the multi-character re-font library of the same style.

Further, the training the convolutional neural network in advance by using the video frame data set to obtain a trained convolutional neural network, which comprises the following steps:

Splicing video frame pictures of the ith frame and the (i+2) th frame of the stored video data in a channel dimension and then taking the spliced video frame pictures as input of a convolutional neural network;

taking the predicted video frame picture of the i+1st frame as the output of the convolutional neural network;

updating network parameters of the convolutional neural network according to the loss function;

Stopping pre-training when the pre-trained convolutional neural network converges, and obtaining the pre-trained convolutional neural network.

Further, the method for fine tuning the network parameters of the pretrained convolutional neural network by using the multi-character heavy font data set to obtain the fine-tuned convolutional neural network comprises the following steps:

Acquiring a multi-word repeated font data set;

taking two font data separated by one font weight in the multi-font data set as the input of a pre-trained convolutional neural network, and generating font data of an intermediate font weight;

updating the network parameters of the convolutional neural network after the pre-training according to the loss function;

stopping fine tuning when the fine-tuned convolutional neural network converges, and obtaining the fine-tuned convolutional neural network.

Further, the loss function uses the average absolute error loss function and the perception loss function at the same time, and the calculation formulas of the average absolute error loss function and the perception loss function are as follows:

L1_Loss＝||G(F_i,F_i+2)-F_i+1||₁

Wherein L1_Loss represents an average absolute error Loss function, G represents a convolutional neural network G, F _i、F_i+1 and F _i+2 constructed respectively represent video frames of an ith frame, an (i+1) th frame and an (i+2) th frame of video data, LPIPS _Loss represents a perceptual Loss function, weight _L represents coefficients of network parameters of an L layer, ☉ represents a per-site multiplication of coefficients and features of network parameters of the L layer, And/>And respectively representing the result output by the convolutional neural network G and a characteristic diagram of the L layer of the target video frame of the i+1st frame passing through the deep convolutional neural network VGG, wherein the dimension of the characteristic diagram is h multiplied by w.

Further, the conditions for convolutional neural network convergence are: calculating the average absolute error loss function and the value of the perception loss function respectively; the two calculated values are summed and the convolutional neural network converges when the sum of the two values no longer drops.

Further, constructing a convolutional neural network prior to pre-training the convolutional neural network using the video frame dataset, the convolutional neural network consisting of three modules:

The feature compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the 4-dimensional tensor so as to obtain a compressed 4-dimensional tensor; the feature compression coding module consists of a convolution network layer, a feature normalization layer, a feature activation layer and a downsampling network layer, wherein the convolution network layer is used for carrying out linear transformation on video frames to obtain features of the convolution layer; the feature normalization layer is used for carrying out numerical normalization operation on the features obtained by the convolution layer, so that the range of the features is kept between [ -1,1 ]; the feature activation layer is used for carrying out nonlinear transformation on the normalized features to obtain nonlinear features; the downsampling network layer is used for downsampling the obtained nonlinear characteristics in the 2 nd and 3 rd dimensions to obtain nonlinear characteristics with smaller dimensions;

The feature transfer module is used for carrying out nonlinear transformation on the features extracted by the feature compression encoding module for a plurality of times, so that the features after nonlinear transformation have more characterization capability, and the feature transfer module is realized by adding a self-attention mechanism in a residual layer; the residual layer is used for accelerating the convergence speed of the network and stabilizing the output of the network; the self-attention mechanism is used for extracting local characteristics, so that network parameters can be updated conveniently;

The feature decoding module is used for decoding the features after nonlinear transformation into an image space and transforming the features into image output; the feature decoding module consists of a two-dimensional reverse convolution network layer, a feature instance normalization layer, a feature activation layer and an up-sampling network layer, wherein the reverse convolution network layer is used for decoding features, and the feature instance normalization layer is used for normalizing feature values so as to ensure the stability of training; the feature activation layer is used for carrying out nonlinear transformation on the normalized features to obtain nonlinear features; the up-sampling network layer is used for amplifying the size of the characteristic and complementing the space information.

In a preferred implementation manner of the present disclosure, the embodiment of the present application further provides a word stock generating system based on frame interpolation, including:

the video frame data set generation module is used for acquiring video data and storing the video data into pictures frame by frame to obtain a video frame data set composed of the pictures;

the pre-training module is used for pre-training the convolutional neural network by using the video frame data set to obtain a pre-trained convolutional neural network;

The fine tuning module is used for fine tuning network parameters of the pre-trained convolutional neural network by using the multi-character heavy font data set to obtain a fine-tuned convolutional neural network;

and the multi-word repeated word library generating module is used for inputting the large and small repeated word fonts of the same style into the fine-tuned convolutional neural network to obtain the multi-word repeated word library of the same style.

Further, the word stock generation system based on frame interpolation further comprises a convolutional neural network construction module, wherein the convolutional neural network comprises three modules:

In a preferred embodiment of the present disclosure, an electronic device is further provided, where the electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for generating a word stock based on frame interpolation described above when the processor executes the computer program.

In a preferred embodiment of the present disclosure, the present disclosure further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the frame interpolation based word stock generation method described above.

The beneficial effects of the present disclosure are: the present disclosure models the multi-font generation problem as a continuous frame interpolation problem using video data as pre-training data, fine-tuning the network parameters of a constructed convolutional neural network using existing multi-font data sets, thereby generating a variety of different word weights between the coarsest and finest word weights. The method and the device greatly shorten the time required for manufacturing other character weights, not only improve the generation effect, but also have more attractive and reasonable effect compared with the character shapes manufactured by the dot method, have more obvious style characteristics and are more convenient to read.

Drawings

FIG. 1 is a flow chart of a method of generating a word stock based on frame interpolation;

FIG. 2 is a block diagram of a convolutional neural network;

FIG. 3 is a diagram of the components of each module in a word stock generation system based on frame interpolation;

Fig. 4 is a diagram of the effect of multiple word reproduction with font weights between 45-65.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

Example 1

Referring to fig. 1, according to a method for generating a font library based on frame interpolation provided in an exemplary embodiment of the present disclosure, aiming at the technical problem mentioned in the background art, video data is used as pre-training data, the problem of generating multiple fonts is modeled as a continuous frame interpolation problem, and the existing multiple font data set is used to fine tune the network parameters of the constructed convolutional neural network, so as to adapt to the generation of font domain data and improve the generation effect.

The implementation process of the word stock generation method based on frame interpolation as an example comprises the following steps:

Video data is collected and saved in a picture format frame by frame, and named frame numbers, F _i、F_i+1 and F _i+2 represent video frames of the i-th frame, i+1-th frame, and i+2-th frame of the video data, respectively.

A convolutional neural network is constructed, which is mainly composed of three modules: the device comprises a feature compression encoding module, a feature transfer module and a feature decoding module. The modular architecture of the convolutional neural network is shown in fig. 2, in which:

The feature compression coding module is used for coding an input video frame into a 4-dimensional tensor and then compressing the 4-dimensional tensor so as to obtain a compressed 4-dimensional tensor, the feature compression coding module uses a network structure of { Conv-BN-Relu } xN-Downsample, wherein N represents the occurrence times of { Conv-BN-Relu } modules and Downsample, in the scheme, N= [4,8] adopts 2 feature compression coding submodules Enc to form the feature compression coding module, wherein Conv refers to a convolution network layer and is used for carrying out linear transformation on the video frame so as to obtain the features of the convolution layer; BN refers to a feature normalization layer, which is used for carrying out numerical normalization operation on the features obtained by the convolution layer, so that the range of the features is kept between [ -1,1 ]; relu denotes a feature activation layer, which is used for performing nonlinear transformation on the normalized features to obtain nonlinear features; downsample refers to a downsampling network layer for downsampling the obtained nonlinear characteristics in the 2 nd and 3 rd dimensions to obtain smaller-sized nonlinear characteristics, such as: nonlinear characteristics of the (1, 4) dimension can be obtained through the downsampling layer;

the feature transfer module is used for carrying out nonlinear transformation on the features extracted by the feature compression encoding module for a plurality of times, so that the features after nonlinear transformation have more characterization capability; the characteristic transfer module is linked by a residual structure, the main flow part is a self-attention self-attention structure, the residual layer is linked by a 1x1 convolution module, 4 characteristic transfer sub-modules res are used to form the characteristic transfer module in the scheme, wherein the residual layer is used for accelerating the network convergence speed and stabilizing the network output; the self-attention mechanism is used for extracting local characteristics, so that network parameters can be updated conveniently;

The feature decoding module is used for decoding the features after nonlinear transformation into an image space and transforming the features into image output; the feature decoding module is composed of { Conv2dTranspose-IN-Relu } xM-Upsample, wherein M represents the number of times of occurrence of { Conv2dTranspose-IN-Relu } modules and Upsample, IN the scheme, M= [8,4] and 2 feature decoding submodules Dec are adopted to form a feature compression coding module, wherein Conv2dTranspose refers to a two-dimensional reverse convolution network layer for decoding features; the IN refers to a feature instance normalization layer which is used for normalizing feature values and guaranteeing the training stability; relu denotes a feature activation layer, which is used for performing nonlinear transformation on the normalized features to obtain nonlinear features; upsample refers to an upsampling network layer for enlarging the size of features, complementing spatial information such as features of one (1, 2) dimension, which after passing through the upsampling layer becomes (1, 4).

The i-th frame and the i+2-th frame are spliced together in the dimension of the channel to be used as the input of the neural network.

The function of the convolutional neural network is to predict the content of the (i+1) th frame, and the output f_int of the convolutional neural network and the (i+1) th frame calculate the loss to update the network parameters of the convolutional neural network.

The values of the average absolute error loss function l1_loss and the perceptual loss function LPIPS _loss are calculated such that the pre-training is stopped when the sum of the two values no longer falls, and the calculation formulas for l1_loss and LPIPS _loss are as follows:

L1_Loss＝||G(F_i,F_i+2)-F_i+1||₁

Wherein L1_Loss represents an average absolute error Loss function, G represents a convolutional neural network G, F _i、F_i+1 and F _i+2 constructed respectively represent video frames of an ith frame, an (i+1) th frame and an (i+2) th frame of video data, LPIPS _Loss represents a perceptual Loss function, weight _L represents coefficients of network parameters of an L layer, ☉ represents a per-site multiplication of coefficients and features of network parameters of the L layer, And/>The feature map of the L layer of the target video frame of the i+1st frame and the feature map of the L layer of the depth convolution neural network VGG are respectively represented, and the feature map has the size of h multiplied by w, wherein the depth convolution neural network VGG is a classical convolution neural network developed by the scientific engineering system of the oxford university, and is a trained convolution neural network.

After the video data is used for pre-training the convolutional neural network, the multi-character heavy font data set is used for fine adjustment of network parameters of the pre-trained convolutional neural network, and the purpose of consistent data distribution is achieved. The multi-character re-font data set mainly adopts multi-character re-fonts designed by a designer, such as Chinese character flag black, and total 15 character re-fonts.

The font data of the intermediate weight is generated with two font data separated by 1 weight as input. The two loss functions used are the same as for the pre-training phase.

And calculating the values of the average absolute error loss function L1_loss and the perception loss function LPIPS _loss, so that when the sum of the two values is not reduced any more, the pretrained convolutional neural network converges, fine tuning is stopped, and the fine-tuned convolutional neural network is obtained.

And inputting the large and small heavy fonts of the same style into the finely tuned convolutional neural network to obtain the multi-heavy font library of the same style, wherein the effect diagram of the generated multi-heavy fonts is shown in fig. 4, and the weight sizes of the words from left to right are 45, 47, 49 and 65 respectively.

Example 2

As shown in fig. 3, an exemplary frame interpolation-based word stock generation system includes:

The video frame data set generating module is used for collecting video data, storing the video data into pictures frame by frame, and naming the video data with frame numbers F _i、F_i+1 and F _i+2, wherein F _i、F_i+1 and F _i+2 respectively represent video frames of an ith frame, an (i+1) th frame and an (i+2) th frame of the video data.

And the pre-training module is used for splicing the ith frame and the (i+2) th frame together in the dimension of the channel to be used as the input of the neural network. The convolutional neural network has the function of predicting the content of the (i+1) th frame, and updating the network parameters of the convolutional neural network by calculating the loss between the output f_int of the convolutional neural network and the (i+1) th frame, wherein the network parameters are specifically as follows: calculating the values of the average absolute error loss function L1_loss and the perception loss function LPIPS _loss, so that when the sum of the two values is not reduced any more, the convolutional neural network is converged, the pre-training is stopped, and the calculation formulas of the L1_loss and the LPIPS _loss after the pre-training are obtained as follows:

L1_Loss＝||G(F_i,F_i+2)-F_i+1||₁

The fine tuning module is used for carrying out fine tuning on network parameters of the pretrained convolutional neural network by using the multi-character repeated font data set, so as to achieve the aim of consistent data distribution, and specifically comprises the following steps: and taking two font data which are separated by 1 font weight in the multi-font data set as input, generating font data of an intermediate font weight, calculating the values of an average absolute error loss function L1_loss and a perception loss function LPIPS _loss, and when the sum of the two values is not reduced any more, converging the pretrained convolutional neural network, stopping fine tuning and obtaining the fine-tuned convolutional neural network. The multi-character re-font data set mainly adopts multi-character re-fonts designed by a designer, such as Chinese character flag black, and total 15 character re-fonts. The two penalty functions used by the fine tuning module are the same as in the pre-training module described above.

Further, the word stock generation system based on frame interpolation also comprises a convolutional neural network, wherein the convolutional neural network mainly comprises three modules: the device comprises a feature compression encoding module, a feature transfer module and a feature decoding module. The modular architecture of the convolutional neural network is shown in fig. 2, in which:

Example 3

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of generating a word stock based on frame interpolation of embodiment 1 when the computer program is executed.

Embodiment 3 of the present disclosure is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments of the present disclosure.

The electronic device may be in the form of a general purpose computing device, which may be a server device, for example. Components of an electronic device may include, but are not limited to: at least one processor, at least one memory, a bus connecting different system components, including the memory and the processor.

The buses include a data bus, an address bus, and a control bus.

The memory may include volatile memory such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The memory may also include program means having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor executes various functional applications and data processing by running computer programs stored in the memory.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, etc.). Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Example 4

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the frame interpolation-based word stock generation method in embodiment 1.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the frame interpolation based word stock generation method described in embodiment 1, when the program product is run on the terminal device.

Wherein the program code for carrying out the present disclosure may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device, partly on the remote device or entirely on the remote device.

Although embodiments of the present disclosure have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the disclosure, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A word stock generation method based on frame interpolation is characterized by comprising the following steps:

inputting the large character re-fonts and the small character re-fonts of the same style into a fine-tuned convolutional neural network to obtain a multi-character re-font library of the same style;

the method comprises the following steps of:

stopping pre-training when the pre-trained convolutional neural network converges to obtain a pre-trained convolutional neural network;

the network parameters of the convolutional neural network after the pretraining are finely tuned by using the multi-character heavy font data set, so as to obtain the finely tuned convolutional neural network, and the method comprises the following steps of:

Acquiring a multi-word repeated font data set;

2. The frame interpolation-based word stock generation method of claim 1, wherein the loss function uses both a mean absolute error loss function and a perceptual loss function, and the mean absolute error loss function and the perceptual loss function are calculated as follows:

L1_Loss＝||G(F_i,F_i+2)-F_i+1||₁

3. The word stock generation method based on frame interpolation according to claim 2, wherein the condition for the convolutional neural network convergence is:

Calculating the average absolute error loss function and the value of the perception loss function respectively;

the two calculated values are summed and the convolutional neural network converges when the sum of the two values no longer drops.

4. The frame interpolation based word stock generation method of claim 1, further comprising constructing a convolutional neural network prior to pre-training the convolutional neural network using the video frame dataset, the convolutional neural network consisting of three modules:

The feature compression coding module is used for coding the input video frames into 4-dimensional tensors and then compressing the 4-dimensional tensors so as to obtain the compressed 4-dimensional tensors;

The feature transfer module is realized by adding a self-attention mechanism in the residual layer and is used for carrying out nonlinear transformation on the features extracted by the feature compression coding module for a plurality of times;

and the feature decoding module is used for decoding the features output by the feature transfer module into an image space and transforming the image output.

5. A word stock generation system based on frame interpolation, comprising:

the multi-character heavy character library generating module is used for inputting the large character heavy fonts and the small character heavy fonts of the same style into the fine-tuned convolutional neural network to obtain the multi-character heavy character library of the same style;

the pre-training module executes the following method:

The fine tuning module performs the following method:

Acquiring a multi-word repeated font data set;

6. The frame interpolation based word stock generation system of claim 5, further comprising a convolutional neural network construction module, the convolutional neural network consisting of three modules:

and the feature decoding module is used for decoding the features output by the feature transfer module into an image space and converting the features into image output.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the frame interpolation based word stock generation method of any one of claims 1 to 4 when the computer program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the frame interpolation based word stock generation method of any one of claims 1 to 4.