CN116778006B

CN116778006B - Modeling method and device for picture encoder, electronic equipment and storage medium

Info

Publication number: CN116778006B
Application number: CN202310753826.4A
Authority: CN
Inventors: 倪子涵; 王健; 陈金文; 刘路飞; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2024-04-02
Anticipated expiration: 2043-06-25
Also published as: CN116778006A

Abstract

The disclosure provides a modeling method, a modeling device, electronic equipment and a storage medium of a picture encoder, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as smart cities and the like. The specific implementation scheme comprises the following steps: obtaining a visible region and a mask region of a training picture, and the position of the mask region in the training picture; constructing coding features of the mask region based on the supervisory signals; modeling the picture encoder based on the visible region of the training picture, the coding features of the mask region, and the position of the mask region. The technology disclosed by the invention can effectively improve the accuracy of the modeled picture encoder.

Description

Modeling method and device for picture encoder, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like, and can be applied to scenes such as smart cities and the like, in particular to a modeling method, a modeling device, electronic equipment and a storage medium of a picture encoder.

Background

With the development of artificial intelligence technology, machine intelligent processing and intelligent analysis play an increasingly important role in various fields.

For example, in the field of image processing, a pre-trained image encoder may be used to extract features of an image, and further perform processing of downstream tasks such as image classification based on the features of the image.

Disclosure of Invention

The disclosure provides a modeling method and device of a picture encoder, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a modeling method of a picture encoder, including:

obtaining a visible region and a mask region of a training picture, and the position of the mask region in the training picture;

constructing coding features of the mask region based on the supervisory signals;

modeling the picture encoder based on the visible region of the training picture, the coding features of the mask region, and the position of the mask region.

According to another aspect of the present disclosure, there is provided a modeling apparatus of a picture encoder, including:

the acquisition module is used for acquiring a visible region and a mask region of the training picture and the position of the mask region in the training picture;

The construction module is used for constructing the coding features of the mask region based on the supervision signals;

and the modeling module is used for modeling the picture encoder based on the visible region of the training picture, the coding features of the mask region and the position of the mask region.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the aspects and methods of any one of the possible implementations described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of the aspects and any possible implementation described above.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aspects and any one of the possible implementations described above.

According to the technology disclosed by the invention, the accuracy of the modeled picture encoder can be effectively improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

fig. 3 is a training schematic diagram of the picture encoder provided in the present embodiment;

fig. 4 is another training schematic diagram of the picture encoder provided in the present embodiment;

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram of still another training of the picture encoder provided in the present embodiment;

FIG. 7 is a further training schematic diagram of the picture encoder provided in this embodiment;

FIG. 8 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device for implementing the methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.

It should be noted that, the terminal device in the embodiments of the present disclosure may include, but is not limited to, smart devices such as a mobile phone, a personal digital assistant (Personal Digital Assistant, PDA), a wireless handheld device, and a Tablet Computer (Tablet Computer); the display device may include, but is not limited to, a personal computer, a television, or the like having a display function.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

In order to improve the coding accuracy and coding efficiency of the picture encoder, a self-masking mode can be adopted to realize modeling of the picture encoder. Specifically, during modeling, a portion of the pixel block may be randomly masked for the original picture, constituting a visible region and a mask region. And encoding the visible region by a picture encoder to obtain the encoding characteristics of the visible region. And configuring the coding features of the mask region, decoding and recovering the picture by the decoder based on the coding features of the visible region and the coding features of the mask region. Parameter adjustment can then be performed on the picture encoder and the picture decoder based on the original picture and the restored picture, so as to realize modeling of the picture encoder.

In the modeling process of the picture encoder, the coding features of the mask region can be randomly assigned, so that the accuracy of the coding features of the mask region is poor, and the modeling picture encoder is poor in accuracy. Based on this, the present disclosure may provide a modeling scheme that improves the accuracy of a picture encoder.

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure; as shown in fig. 1, the present embodiment provides a modeling method of a picture encoder, which specifically includes the following steps:

s101, acquiring a visible region and a mask region of a training picture, and positioning the mask region in the training picture;

s102, constructing coding features of a mask region based on a supervision signal;

s103, modeling the picture encoder based on the visible region of the training picture, the coding features of the mask region and the position of the mask region.

In this embodiment, the modeling method of the picture encoder refers to a pre-training method of the picture encoder, and modeling of the picture encoder is achieved by pre-training the picture encoder.

In the modeling method of the picture encoder of the present embodiment, the implementation is based on a mask. Specifically, for any training picture, masking may be performed on a portion of the pixel blocks to obtain a masked region. For example, in a specific implementation, the training picture may be segmented into n×n pixel blocks, and masking is randomly performed on a portion of the pixel blocks to obtain the mask region.

Wherein the visible region and the mask region respectively include a plurality of pixel blocks. The mask region includes a plurality of pixel blocks, which may also be referred to as a plurality of mask blocks. The mask blocks may or may not be connected in the training picture. However, after masking, the position of each mask block in the mask area in the training picture needs to be recorded, so that the subsequent decoding of the decoder is convenient.

In this embodiment, in order to improve accuracy of the modeled picture encoder, a supervisory signal may be introduced when constructing the coding features of the mask region, and the coding features of the mask region may be constructed based on the supervisory signal. And further, the picture encoder is modeled based on the visible region of the training picture, the coding features of the mask region and the position of the mask region, and the modeling process can be more effectively supervised by the introduced supervision signal, so that the accuracy of the modeled picture encoder can be effectively improved.

According to the modeling method of the picture encoder, the coding characteristics of the mask region are constructed by introducing the supervision signals, so that the modeling process can be supervised more effectively, and the accuracy of the modeled picture encoder can be improved effectively.

Based on the technical solution of the embodiment described in fig. 1, step S102 may specifically include the following steps:

(1) Acquiring coding features corresponding to the supervision signals;

(2) And constructing the coding features of the mask region based on the coding features of the supervisory signals.

Further optionally, the specific implementation of step (2) may include the following implementation manner:

in the first implementation manner, the coding features of the supervisory signals are used as the coding features of each mask block of the mask region.

The method is simple to realize, the coding features of the supervision signals are directly used as the coding features of each mask block of the mask region, accurate and effective supervision can be provided for the picture decoder during decoding, the training process of the picture encoder can be accurately and effectively supervised, and the accuracy of the modeled picture encoder is improved.

In a second implementation, the currently learned dynamic coding features of each mask block in the mask region are acquired first; and constructing the coding features of the mask region based on the currently learned dynamic coding features of each mask block in the mask region and the coding features of the supervisory signals.

In a second implementation, the construction of the coding features of the mask region refers to features of two aspects: on the one hand, the coding characteristics of the supervisory signals and on the other hand, the dynamic coding characteristics which are learned currently by each mask block are referred to. I.e. the dynamic coding feature of each mask block is a learnable feature vector.

Since the training pictures are the same size, e.g. all comprise N x N pixel blocks, when training the picture encoder. The preset mask proportion is fixed, namely, when different training pictures are adopted for training, the number of mask blocks included in the mask area is fixed. However, when different training pictures are used for training, the positions of the mask blocks included in the mask region can be different. For example, when training based on the first training picture, the mask region includes an ith pixel block and a jth pixel block in n×n pixel blocks; j is greater than i. And when training is performed based on the second training picture, the mask region comprises an mth pixel block and an nth pixel block in N x N pixel blocks, and N is larger than m. Wherein m is neither equal to i nor j; n is neither equal to i nor j.

Therefore, at the beginning of training of the picture encoder, a dynamic coding feature can be randomly assigned to each mask block based on a preset mask proportion. For example, in each round of training, the mask region includes two mask blocks, and in the first round of training, two dynamic coding features can be generated by random assignment, wherein the first dynamic coding feature is a mask block with smaller sequence identification in a training picture, namely an ith mask block, so as to construct coding features; the second dynamic coding feature is a mask block for sequentially identifying a larger mask block, namely a j-th mask block, in the training picture, so as to construct the coding feature.

And in the first training round, two dynamic coding features can be optimized and adjusted. During the second training, the first dynamic coding feature which is learned currently can be used for sequentially identifying the mth mask block with smaller sequence to construct the coding feature. And the second dynamic coding feature which is learned currently is provided for the nth mask block with larger sequence identification to construct the coding feature. And so on, the corresponding number of dynamic coding features can be configured for other mask blocks, and in each round of training, one dynamic coding feature is allocated to each mask block, so that the construction of the coding features of the corresponding mask blocks is realized, and the details are not repeated. In addition, in each round of training process, the dynamic coding characteristics of the mask block can be optimized and adjusted, so that the dynamic coding characteristics of the mask block are expressed more reasonably and accurately.

In the implementation manner, the coding features of the mask region can be constructed more accurately and efficiently by referring to the features of the two aspects.

Further, in the second implementation manner, the following two cases may be further included:

in the first case, the currently learned dynamic coding features of each mask block in the mask region are spliced with the coding features of the supervisory signals to serve as the coding features of each mask block in the mask region.

Case two, may include the following steps:

(a) Acquiring weights of all mask blocks in the mask region based on the positions of all mask blocks in the mask region in the training picture;

for example, in this embodiment, a central area and a non-central area of the training picture may be divided in advance, where the central area is an area close to a central point of the training picture; the non-central region is located outside the central region, surrounding other regions of the central region. In practical application, the closer to the center of the picture, the more the content contained in the picture can embody the theme of the picture. And the further from the center of the picture, the further from the subject matter of the picture it contains. Based on this, different weights may be configured for the center region and the non-center region of the training picture, such that the weight of each pixel block of the center region of the training picture is greater than the weight of each pixel block of the non-center region. For example, the weight of each pixel block of the center region of the training picture may be set to be greater than 1, while the weight of each pixel block of the non-center region is set to be less than 1.

In specific implementation, the positions of the mask blocks in the mask region in the training picture can be detected, and whether the mask blocks are in the central region or the non-central region can be further determined, so that the weights of the mask blocks in the mask region can be determined.

For another example, in this embodiment, the distance between each mask block and the center point of the training picture may also be calculated based on the position of each mask block in the mask region in the training picture. The distance between each mask block and the center point of the training picture can be taken as the distance between the center point of each mask block and the center point of the training picture. And combining with the actual scene, the smaller the distance is, the closer the mask block is to the center point of the training picture, and the larger the probability that the mask block contains the content similar to the theme of the training picture is. Based on this, the weights of the pixel blocks may be configured based on the distance between the pixel blocks and the center point of the training picture, and the specific configuration mode is not limited as long as the higher the weight of the pixel block closer to the center point of the training picture and the lower the weight of the pixel block farther from the center point of the training picture can be ensured.

(b) Acquiring weighted coding features of each mask block in the mask region based on the current learned dynamic coding features and the corresponding weights of each mask block in the mask region;

In this embodiment, the dynamic coding features of each mask block may be continuously learned and continuously changed in the modeling process of the picture encoder. In each training round, the product of the currently learned dynamic coding feature of each mask block and the corresponding weight can be taken as the weighted coding feature of the mask block. In this way, for mask blocks with high weights, the learned dynamic coding features can be strengthened in the coding process; and for mask blocks with low weight, the learned dynamic coding characteristics can be weakened in the coding process.

(c) And splicing the weighted coding features of each mask block in the mask region with the coding features of the supervisory signals to serve as the coding features of each mask block in the mask region.

By the method, the coding features of each mask block in the mask region can be constructed more accurately and efficiently.

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure; as shown in fig. 2, the present embodiment provides a modeling method of a picture encoder, in this embodiment, taking a supervision signal as a text type label as an example, the method may specifically include the following steps:

s201, acquiring a visible region and a mask region of a training picture, and positioning the mask region in the training picture;

Specifically, the training picture may be subjected to mask processing according to a preset mask ratio, and divided into a visible region and a mask region.

In this embodiment, in order to improve the encoding effect of the picture encoder, the preset mask proportion may reach 50% and more than 50%, even 75%. The masking process of this embodiment may also be a random masking process, i.e., the location of the pixel blocks of the masking process mask may be random, not fixed, each time. However, in different rounds of training, the preset mask ratio is fixed.

The positions of the mask areas in the training pictures comprise positions of all mask blocks in the mask areas in the training pictures.

S202, coding text class labels of training pictures by adopting a pre-trained text coder to obtain text coding features of the text class labels;

s203, acquiring the currently learned dynamic coding characteristics of each mask block in the mask region;

s204, constructing coding features of each mask block in the mask region based on the currently learned dynamic coding features of each mask block in the mask region and the text coding features of the text class labels;

in this embodiment, the text class label of the training picture is used as the supervision signal. Specifically, the first or second implementation manner in the second embodiment may be adopted to efficiently and accurately construct the coding features of each mask block in the mask area.

When the text coding feature of the text class label is not consistent with the currently learned dynamic coding feature of each mask block in the first or second splicing, the text coding feature of the text class label needs to be mapped to be consistent with the dynamic coding feature.

Alternatively, in this embodiment, step S203 and step S204 may be omitted, and the text coding feature of the text category label may be directly used as the coding feature of each mask block.

S205, encoding the visible region by adopting a picture encoder to obtain the encoding characteristics of the visible region;

s206, decoding by adopting a picture decoder based on the coding features of the visible region, the coding features of the mask region and the positions of the mask region to obtain a restored picture;

as described in the above steps, the encoded features of the mask region include the encoded features of each mask block of the mask region, and the locations of the mask region include the locations of each mask block in the mask region. Correspondingly, the coding features of the visible region include coding features of each pixel block within the visible region.

Based on the method, the coding features of each pixel block in the visible region and the coding features of each mask block in the mask region can be spliced according to the positions of each pixel block and each mask block in the training picture, so that the global coding features of the training picture are obtained. And then adopting a decoder to decode the global coding characteristics of the training pictures, predicting and outputting a recovery picture.

S207, parameter adjustment is carried out on the picture encoder and the picture decoder based on the training pictures and the recovery pictures.

In particular, two ways may be included:

in a first mode, a first loss function is established based on a training picture and a recovery picture; and performing parameter adjustment on the picture encoder and the picture decoder by taking the convergence of the first loss function as an adjustment direction.

In this implementation, the first loss function is established based on the pixels of the training picture and the recovery picture, so the first loss function may be a loss function at the pixel level of the picture. Specifically, the first loss function is constructed based on pixel values for each position of the training picture and the recovery picture.

Preferably, the first Loss function, which may also be referred to as L2 Loss (Loss), may be constructed in the manner of a mean square error (Mean Square Error; MSE).

In a second mode, a current picture encoder is adopted to respectively acquire a first coding feature of a training picture and a second coding feature of a recovery picture; establishing a second loss function based on the first encoding feature and the second encoding feature; and performing parameter adjustment on the picture encoder and the picture decoder by taking the convergence of the second loss function as an adjustment direction.

In this implementation, the second loss function is established based on the first coding feature of the training picture and the second coding feature of the recovery picture, so the second loss function may be a loss function of the picture feature level. Specifically, compared with the first mode, the second mode focuses more on the characteristic information of the picture or focuses more on the semantic information of the picture, and compared with the previous loss function of the picture pixel level, the calculation efficiency is higher, so that the accuracy of the modeled picture encoder can be effectively improved; and is closer to the way the downstream feature extraction task is used.

Similarly, the second Loss function may also acquire a corresponding Loss function in an L2 Loss manner.

In this embodiment, the text coding feature of the text class label of the training picture is added in the coding feature of each mask block of the mask region, and when decoding and recovering, the decoder can use the text coding feature of the text class label as supervision to guide the recovery of the mask region, thereby implicitly realizing the alignment of the picture feature and the text feature, and further improving the generalization of the model by using the multi-mode technology.

It should be noted that, in this embodiment, according to step S203 and step S204, it can be known that the coding features of each mask block in the mask region further include the currently learned dynamic coding features of each mask block, where the currently learned dynamic coding features of each mask block are feature vectors that can be learned in the modeling process of the picture encoder. Therefore, in step S207, the parameters of the picture encoder and the picture decoder are adjusted, and the dynamic coding characteristics of each mask block need to be adjusted so that the loss function converges.

Of course, alternatively, if the step S203 and the step S204 are removed when the coding feature of each mask block is constructed, the text coding feature of the text class label is directly used as the coding feature of each mask block. At this time, step S207 does not need to adjust the dynamic coding characteristics of each mask block while performing parameter adjustment on the picture encoder and the picture decoder.

Based on the above steps of the present embodiment, a training schematic diagram of the picture encoder of the present embodiment provided in fig. 3 can be obtained. As shown in fig. 3, taking the example of establishing a loss function at the picture pixel level, the technical solution of the present disclosure is described.

Further, another training schematic of the picture encoder of the present embodiment provided in fig. 4 may also be obtained. As shown in fig. 4, taking the example of establishing a loss function of a picture feature level, a technical solution of the present disclosure is described.

As shown in fig. 3 and fig. 4, the text type label of the training picture is used as a supervision signal to guide the decoder to decode more accurately and effectively, so that parameters of the picture encoder and the picture decoder can be supervised effectively to adjust more accurately, and modeling accuracy of the picture encoder is improved.

As shown in fig. 3 and 4, in the training process, in this embodiment, the dynamic coding features of the picture encoder, the picture decoder, and even each mask block need to be adjusted so that the loss function converges. But after training, only the picture encoder is deployed when the on-line deployment is finished. By adding the supervision of the supervision signals, the trained picture encoder has better identification on text type labels after encoding pictures, and can improve the accuracy of downstream tasks such as picture classification task execution based on the picture encoder.

In this embodiment, taking any training picture as an example, the training process of the picture encoder and the picture decoder is described. In practical application, a large number of training pictures can be adopted according to the steps, and the picture encoder and the picture decoder are trained according to the steps of the embodiment until the loss function converges, so that parameters of the picture encoder are determined, and further modeling of the picture encoder is completed.

According to the modeling method of the picture encoder, the text type label of the training picture is introduced to serve as a supervision signal, the coding characteristic of the mask region is constructed based on the coding characteristic of the text type label, effective supervision is provided for the decoder to recover the picture, alignment of the picture characteristic and the text characteristic can be implicitly realized, the training efficiency of the picture encoder and the picture decoder is improved, the accuracy of the modeled picture encoder can be effectively improved, and effective support is provided for downstream tasks realized based on the picture encoder such as picture classification tasks.

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure; as shown in fig. 5, the present embodiment provides a modeling method of a picture encoder, in this embodiment, taking a supervision signal as an example of a picture feature of a training picture, the method may specifically include the following steps:

s501, acquiring a visible region and a mask region of a training picture, and positioning the mask region in the training picture;

reference is made to the description of step S201, and the description thereof is omitted here.

S502, extracting picture features of a training picture by adopting a pre-trained picture feature extraction model;

for example, the image feature extraction model may be any pre-trained model capable of extracting image features.

Preferably, in this embodiment, the pre-trained image feature extraction model may adopt a feature extraction unit of an image branch in the pre-trained image-text multi-mode feature extraction model.

For example, a high-performance contrast text-to-Image Pre-training (CLIP) model of open source may be selected as the teletext multi-modal feature extraction model. And selecting a feature extraction unit of a picture branch to extract the picture features of the training picture in the step, so that the accuracy of the extracted picture features can be effectively ensured.

S503, acquiring the currently learned dynamic coding characteristics of each mask block in the mask region;

s504, constructing coding features of each mask block in the mask region based on the currently learned dynamic coding features of each mask block in the mask region and picture features of the training pictures;

in this embodiment, the picture feature of the training picture is used as the supervision signal. Specifically, the first or second implementation manner in the second embodiment may be adopted to efficiently and accurately construct the coding features of each mask block in the mask area.

Alternatively, in this embodiment, step S503 and step S504 may be omitted, and the picture features of the training picture may be directly used as the coding features of each mask block.

S505, coding the visible region by adopting a picture coder to obtain coding characteristics of the visible region;

s506, decoding by adopting a picture decoder based on the coding features of the visible region, the coding features of the mask region and the positions of the mask region to obtain a restored picture;

s507, parameter adjustment is carried out on the picture encoder and the picture decoder based on the training pictures and the recovery pictures.

In specific implementation, the two ways of step S207 in the embodiment shown in fig. 2 may be included, which is not described herein.

In this embodiment, the picture features of the training picture are added to the coding features of each mask block in the mask region, and the decoder can use the picture features of the training picture as supervision to guide the recovery of the mask region when decoding and recovering, so as to effectively guide the parameters of the picture encoder and the picture decoder to effectively adjust, and improve the training efficiency.

It should be noted that, in this embodiment, according to step S503 and step S504, it can be known that the coding features of each mask block in the mask region further include the currently learned dynamic coding features of each mask block, and the currently learned dynamic coding features of each mask block are feature vectors that can be learned in the modeling process of the picture encoder. Therefore, in step S507, the parameters of the picture encoder and the picture decoder are adjusted, and the dynamic coding characteristics of each mask block need to be adjusted so that the loss function converges.

Of course, alternatively, if the step S503 and the step S504 are removed when the coding feature of each mask block is constructed, the picture feature of the training picture is directly used as the coding feature of each mask block. At this time, in step S507, the parameters of the picture encoder and the picture decoder are adjusted, and the dynamic coding features of each mask block do not need to be adjusted.

Based on the above steps of the present embodiment, a further training schematic diagram of the picture encoder of the present embodiment provided in fig. 6 can be obtained. As shown in fig. 6, taking the example of establishing a loss function at the picture pixel level, the technical solution of the present disclosure is described.

Further, a further training schematic of the picture encoder of the present embodiment provided in fig. 7 may be obtained. As shown in fig. 7, taking the example of establishing a loss function of a picture feature level, a technical solution of the present disclosure is described.

As shown in fig. 6 and fig. 7, the picture features of the training pictures are used as supervisory signals to guide the decoder to decode more accurately and effectively, so that parameters of the picture encoder and the picture decoder can be effectively supervised to adjust more accurately, and modeling accuracy of the picture encoder is improved.

As shown in fig. 6 and 7, in the training process, in this embodiment, the dynamic coding characteristics of the picture encoder, the picture decoder, and even each pixel block need to be adjusted so that the loss function converges. But after training, only the picture encoder is deployed when the on-line deployment is finished. By the aid of the trained picture encoder, due to the fact that supervision of supervision signals is added, the trained picture encoder has better identification on text type labels of pictures, and accuracy of downstream tasks such as picture classification task execution based on the picture encoder can be improved.

According to the modeling method of the picture encoder, the picture characteristics of the training pictures are introduced to serve as the supervision signals, the coding characteristics of the mask region are constructed based on the picture characteristics of the training pictures, effective supervision is provided for the decoder to recover the pictures, parameters of the picture encoder and the picture decoder can be effectively guided to be effectively adjusted, further accuracy of the modeled picture encoder can be effectively improved, and modeling efficiency of the picture encoder is improved. The picture encoder modeled by the embodiment can better understand pictures, and can obtain better generalization in downstream tasks.

FIG. 8 is a schematic diagram according to a fourth embodiment of the present disclosure; as shown in fig. 4, the present embodiment provides a modeling apparatus 800 of a picture encoder, including:

An obtaining module 801, configured to obtain a visible region and a mask region of a training picture, and a position of the mask region in the training picture;

a construction module 802, configured to construct the coding feature of the mask region based on the supervisory signal;

a modeling module 803 is configured to model the picture encoder based on the visible region of the training picture, the coding feature of the mask region, and the position of the mask region.

The modeling apparatus 800 of the picture encoder of the present embodiment realizes the implementation principle and the technical effect of modeling the picture encoder by adopting the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related embodiments may be referred to the description of the above related embodiments, which is not repeated herein.

FIG. 9 is a schematic diagram according to a fifth embodiment of the present disclosure; as shown in fig. 5, the present embodiment provides a modeling apparatus 900 of a picture encoder, including: the same name and function modules shown in fig. 8 are as follows: an acquisition module 901, a construction module 902, and a modeling module 903.

As shown in fig. 9, in an embodiment of the present disclosure, a construction module 902 includes:

an obtaining unit 9021, configured to obtain a coding feature corresponding to the supervisory signal;

A construction unit 9022 is configured to construct the coding feature of the mask region based on the coding feature of the supervisory signal.

Optionally, in an embodiment of the present disclosure, the building unit 9022 is configured to:

and taking the coding characteristic of the supervision signal as the coding characteristic of each mask block of the mask region.

acquiring the current learned dynamic coding characteristics of each mask block in the mask region;

and constructing the coding features of the mask region based on the currently learned dynamic coding features of each mask block in the mask region and the coding features of the supervisory signals.

and splicing the currently learned dynamic coding features of each mask block in the mask region with the coding features of the supervision signals to serve as the coding features of each mask block in the mask region.

acquiring weights of all mask blocks in the mask region based on the positions of all mask blocks in the mask region in the training picture;

Acquiring weighted coding features of each mask block in the mask region based on the current learned dynamic coding features and the corresponding weights of each mask block in the mask region;

and splicing the weighted coding features of each mask block in the mask region with the coding features of the supervision signals to serve as the coding features of each mask block in the mask region.

Optionally, in an embodiment of the present disclosure, the obtaining unit 9021 is configured to:

and encoding the text category labels of the training pictures by adopting a pre-trained text encoder to obtain the text encoding characteristics of the text category labels.

and extracting the picture characteristics of the training picture by adopting a pre-trained picture characteristic extraction model.

Optionally, in an embodiment of the present disclosure, the pre-trained image feature extraction model employs a feature extraction unit of an image branch in the pre-trained image-text multi-modal feature extraction model.

Optionally, as shown in fig. 9, in one embodiment of the present disclosure, the modeling module 903 includes:

an encoding unit 9031, configured to encode the visible region with the picture encoder, to obtain an encoding feature of the visible region;

A decoding unit 9032, configured to decode with the picture decoder based on the coding feature of the visible region, the coding feature of the mask region, and the position of the mask region, to obtain a restored picture;

an adjusting unit 9033, configured to perform parameter adjustment on the picture encoder and the picture decoder based on the training picture and the recovery picture.

Optionally, in an embodiment of the present disclosure, the adjusting unit 9033 is configured to:

establishing a first loss function based on the training picture and the recovery picture;

and performing parameter adjustment on the picture encoder and the picture decoder by taking the convergence of the first loss function as an adjustment direction.

respectively acquiring a first coding characteristic of the training picture and a second coding characteristic of the recovery picture by adopting the current picture encoder;

establishing a second loss function based on the first encoding feature and the second encoding feature;

and performing parameter adjustment on the picture encoder and the picture decoder by taking the convergence of the second loss function as an adjustment direction.

Optionally, in an embodiment of the present disclosure, the adjusting unit 9033 is further configured to:

and adjusting the dynamic coding characteristics of each mask block in the mask region while adjusting parameters of the picture encoder and the picture decoder based on the training picture and the recovery picture.

The modeling apparatus 900 of the picture encoder of the present embodiment realizes the implementation principle and the technical effect of modeling the picture encoder by adopting the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related embodiments may be referred to the description of the above related embodiments, which is not repeated herein.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as the above-described methods of the present disclosure. For example, in some embodiments, the above-described methods of the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the above-described methods of the present disclosure described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the above-described methods of the present disclosure in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A modeling method of a picture encoder, comprising:

modeling the picture encoder based on the visible region of the training picture, the encoding features of the mask region, and the position of the mask region;

Wherein constructing the encoded features of the mask region based on the supervisory signals comprises:

acquiring coding features corresponding to the supervision signals;

constructing the coding features of the mask region based on the coding features of the supervisory signals;

wherein modeling the picture encoder based on the visible region of the training picture, the coding features of the mask region, and the position of the mask region, comprises:

the picture encoder is adopted to encode the visible region, so that the encoding characteristics of the visible region are obtained;

decoding by adopting a picture decoder based on the coding features of the visible region, the coding features of the mask region and the position of the mask region to obtain a restored picture;

and carrying out parameter adjustment on the picture encoder and the picture decoder based on the training picture and the recovery picture.

2. The method of claim 1, wherein constructing the encoded features of the mask region based on the encoded features of the supervisory signals comprises:

3. The method of claim 1, wherein constructing the encoded features of the mask region based on the encoded features of the supervisory signals comprises:

4. A method according to claim 3, wherein constructing the encoded features of the mask region based on the currently learned dynamic encoded features of each mask block in the mask region and the encoded features of the supervisory signal comprises:

5. A method according to claim 3, wherein constructing the encoded features of the mask region based on the currently learned dynamic encoded features of each mask block in the mask region and the encoded features of the supervisory signal comprises:

6. The method according to any one of claims 1-5, wherein obtaining the coding feature corresponding to the supervisory signal comprises:

7. The method according to any one of claims 1-5, wherein obtaining the coding feature corresponding to the supervisory signal comprises:

8. The method according to claim 7, wherein the pre-trained picture feature extraction model employs a feature extraction unit of a picture branch in the pre-trained teletext multimodal feature extraction model.

9. The method of claim 1, wherein parameter adjusting the picture encoder and the picture decoder based on the training picture and the recovery picture comprises:

10. The method of claim 1, wherein parameter adjusting the picture encoder and the picture decoder based on the training picture and the recovery picture comprises:

11. The method of any one of claims 1, 9-10, wherein the method further comprises:

12. A modeling apparatus of a picture encoder, comprising:

a modeling module for modeling the picture encoder based on the visible region of the training picture, the encoding features of the mask region, and the position of the mask region;

wherein, the construction module includes:

the acquisition unit is used for acquiring the coding characteristics corresponding to the supervision signals;

a construction unit, configured to construct the coding feature of the mask region based on the coding feature of the supervisory signal;

wherein the modeling module comprises:

the coding unit is used for coding the visible region by adopting the picture coder to obtain coding characteristics of the visible region;

the decoding unit is used for decoding by adopting the picture decoder based on the coding features of the visible region, the coding features of the mask region and the position of the mask region to obtain a restored picture;

and the adjusting unit is used for adjusting parameters of the picture encoder and the picture decoder based on the training picture and the recovery picture.

13. The apparatus of claim 12, wherein the construction unit is configured to:

14. The apparatus of claim 12, wherein the construction unit is configured to:

15. The apparatus of claim 14, wherein the construction unit is configured to:

16. The apparatus of claim 14, wherein the construction unit is configured to:

17. The apparatus according to any one of claims 12-16, wherein the acquisition unit is configured to:

18. The apparatus according to any one of claims 12-16, wherein the acquisition unit is configured to:

19. The apparatus of claim 18, wherein the pre-trained picture feature extraction model employs a feature extraction unit of a picture branch in a pre-trained teletext multimodal feature extraction model.

20. The apparatus of claim 12, wherein the adjustment unit is configured to:

21. The apparatus of claim 12, wherein the adjustment unit is configured to:

22. The apparatus of any of claims 12, 20-21, wherein the adjustment unit is further configured to:

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-11.