CN116012662A

CN116012662A - Feature encoding and decoding method, and method, device and medium for training encoder and decoder

Info

Publication number: CN116012662A
Application number: CN202211463056.1A
Authority: CN
Inventors: 施晓迪; 林聚财; 江东; 粘春湄; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-04-25

Abstract

The application discloses a feature coding and decoding method, a training method, equipment and a medium of a coder and a decoder, wherein the method comprises the following steps: extracting features of the target image by using a feature extraction module to obtain original features; compressing the original features by using an encoder to obtain compressed features of the target image, wherein the compressed features are used for being provided for a task end to reconstruct features by using a decoder, and executing a back-end visual task based on the reconstructed features; the network parameters in the encoder are obtained based on reconstruction loss and at least one association loss, the association loss is used for reflecting the accuracy of the task result obtained by the back-end visual task, and the encoder can be more suitable for the back-end visual task to accurately encode by adjusting the reconstruction loss and the at least one association loss, so that the accuracy of image compression is improved.

Description

Feature encoding and decoding method, and method, device and medium for training encoder and decoder

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a feature encoding and decoding method, a training method of an encoding and decoding device, a device, and a medium.

Background

In general, in the process of transmitting image data, the image data is usually compression-encoded for data transmission, or the image features are compression-encoded for transmission after feature extraction. The recipient may perform a decoding reconstruction of the received data to obtain image data or characteristics thereof, which may be used to perform machine vision tasks such as image recognition, image classification, etc.

The applicant of the present application finds that, in a long-term development process, the existing reconstruction features obtained by performing compression reconstruction on image data are not accurate enough, so that the accuracy of implementing a machine vision task by using the reconstruction data is not enough, for example, identification or classification cannot be performed accurately.

Disclosure of Invention

The technical problem that this application mainly solves is to provide a characteristic coding and decoding method, training method, apparatus and medium of the coder-decoder, can improve the accuracy of the image compression.

In order to solve the technical problems, one technical scheme adopted by the application is as follows: there is provided a training method of an image processing model, the method comprising: extracting features of the target image by using a feature extraction module to obtain original features; compressing the original features by using an encoder to obtain compressed features of the target image, wherein the compressed features are used for being provided for a task end to reconstruct features by using a decoder, and executing a back-end visual task based on the reconstructed features; the network parameters in the encoder are obtained by adjusting the network parameters based on a reconstruction loss and at least one association loss in the training process, the reconstruction loss represents the loss generated after the characteristics of the sample image are compressed by the encoder and reconstructed by the decoder, and the association loss is used for reflecting the accuracy of a task result obtained by performing a back-end visual task by using the characteristics of the sample image reconstructed by the decoder.

Wherein, before compressing the original feature with the encoder to obtain the compressed feature of the target image, the method further comprises: the method comprises the steps that a feature extraction module, an encoder, a decoder and a visual task module are sequentially connected to form a multi-task network system, and the visual task module is used for executing a rear-end visual task; inputting the sample image into a multi-task network system for image processing, and acquiring reconstruction loss and at least one association loss corresponding to a back-end visual task by utilizing image processing data generated after the sample image passes through a decoder in the multi-task network system; network parameters in the encoder and decoder are adjusted based on the reconstruction loss and at least one associated loss.

The method for obtaining the reconstruction loss and at least one association loss corresponding to the back-end visual task by inputting the sample image into the multi-task network system for image processing and utilizing the image processing data generated after the decoder in the multi-task network system comprises the following steps: extracting features of the sample image by using a feature extraction module to obtain original sample features; compressing the original sample characteristics by using an encoder to obtain compressed sample characteristics; reconstructing the compressed sample features by using a decoder to obtain reconstructed sample features; obtaining reconstruction loss based on the original sample characteristics and first reconstruction characteristics, wherein the first reconstruction characteristics are reconstructed sample characteristics or are obtained by carrying out preset treatment on the reconstructed sample characteristics; and acquiring at least one of a first association loss and a second association loss corresponding to the back-end visual task, wherein the first association loss is obtained based on a first reconstruction feature and a second reconstruction feature of a reference image related to the sample image, the first reconstruction feature and the second reconstruction feature are features obtained by respectively carrying out the same processing on the sample image and the reference image by using a multi-task network system, and the second association loss is obtained based on a visual processing result output by a visual task module.

Wherein obtaining at least one of a first association loss and a second association loss corresponding to the back-end visual task comprises: in response to the visual task module not being a learnable network, obtaining a first association loss; in response to the visual task module being a learnable network, obtaining a first association loss and/or a second association loss; the step of obtaining the first association loss comprises the following steps: obtaining a first association loss by using the similarity between the first reconstruction feature and the second reconstruction feature of each reference image, wherein the reference image comprises a positive sample and a negative sample of the sample image; the step of obtaining the second association loss includes: and obtaining a second association loss by utilizing a visual processing result output by the visual task module and an actual visual result marked by the sample image, wherein the visual processing result is obtained by the visual task module performing a rear-end visual task based on the first reconstruction feature.

Wherein, before compressing the original sample feature with the encoder to obtain a compressed sample feature, the method further comprises: taking the original sample characteristic as a first characteristic to be processed; preprocessing the first feature to be processed to obtain a preprocessing feature, wherein the preprocessing is used for adjusting data distribution in the first feature to be processed, and the preprocessing feature is used for inputting an encoder for compression.

Wherein, before obtaining the reconstruction loss based on the original sample feature and the first reconstruction feature, the method further comprises: taking the reconstructed sample characteristic as a second to-be-processed characteristic; and performing inverse pretreatment corresponding to pretreatment on the second feature to be treated to obtain an inverse pretreatment feature, wherein the inverse pretreatment feature is a first reconstruction feature.

Wherein, before compressing the original feature with the encoder to obtain the compressed feature of the target image, the method further comprises: taking the original characteristic as a first characteristic to be processed; preprocessing the first feature to be processed to obtain a preprocessing feature, wherein the preprocessing is used for adjusting data distribution in the first feature to be processed, and the preprocessing feature is used for inputting an encoder for compression.

The method comprises the steps of preprocessing a first feature to be processed to obtain a preprocessed feature, wherein the preprocessing feature comprises the following steps: carrying out data normalization on the first feature to be processed to obtain a preprocessing feature; or obtaining a scaling factor and a shifting factor by utilizing preset network learning, and performing spatial feature transformation on the first feature to be processed based on the scaling factor and the shifting factor to obtain a preprocessing feature.

The method for compressing the original features by using the encoder to obtain the compressed features of the target image comprises the following steps: performing dimension reduction transformation on the original characteristics by using a dimension transformation network to obtain dimension reduction characteristics; and carrying out quantization treatment on the dimension reduction characteristics to obtain compression characteristics.

The method for obtaining the dimension-reducing feature comprises the steps of: obtaining characteristics to be input by utilizing the original characteristics; at least one of the following first iterative processes is performed to obtain a target attention characteristic: performing self-attention processing on the first input features by using a first self-attention network of the dimension transformation network, wherein the first input features of the first iteration processing are features to be input, and the first input features of the non-first iteration processing are self-attention processing results of the last first iteration processing; at least one of the following second iterative processes to obtain a dimension-reduction feature: and performing dimension reduction processing on the second input features to obtain dimension reduction processing features, and performing self-attention processing on the dimension reduction processing features by using a second self-attention network of the dimension transformation network, wherein the second input features of the first second iteration processing are target attention features, and the second input features of the non-first second iteration processing are self-attention processing results of the last second iteration processing.

The method for obtaining the feature to be input by utilizing the original feature comprises the following steps: downsampling the original features to obtain downsampled features; partitioning the downsampled features to obtain a plurality of feature blocks; linearly embedding the plurality of feature blocks to obtain features to be input; performing dimension reduction processing on the second input feature to obtain dimension reduction processing features, wherein the dimension reduction processing features comprise: dividing a plurality of feature blocks in the second input feature into a plurality of parts; each feature block is respectively formed into a new feature block, and a plurality of new feature blocks are utilized to form the dimension reduction processing feature.

The method for carrying out quantization processing on the dimension reduction features to obtain compression features comprises the following steps: determining a plurality of characteristic value intervals based on characteristic value distribution conditions in the dimension reduction characteristics; dividing the feature values in the dimension reduction feature into a plurality of feature groups by using a plurality of feature value intervals; and respectively carrying out quantization processing on each group of characteristic groups by utilizing different quantization parameters so as to obtain compression characteristics.

Wherein, based on the characteristic value distribution condition in the dimension reduction characteristic, a plurality of characteristic value intervals are determined, including: counting the feature values in the dimension reduction features to obtain feature concentration trend characterization values of the dimension reduction features; multiplying the characteristic central tendency characterization value by different ratios to obtain a plurality of interval boundary values; obtaining a plurality of characteristic value intervals by utilizing a plurality of interval boundary values; each group of features is quantized using different quantization parameters to obtain compressed features, including: for each group of feature groups, carrying out normalization processing on each feature value in the feature groups to obtain a normalization result of each feature value; carrying out quantization processing on the normalization result of each characteristic value in the characteristic group by utilizing the quantization parameter corresponding to the characteristic group to obtain a quantization result of each characteristic value in the characteristic group; and obtaining compression characteristics by using the quantization result of each characteristic value in each characteristic group.

In order to solve the technical problems, one technical scheme adopted by the application is as follows: there is provided a training method of an image processing model, the method comprising: receiving a compression characteristic of the target image obtained by the encoder; reconstructing the compression characteristics by using a decoder to obtain reconstruction characteristics of the target image, wherein the reconstruction characteristics are used for realizing a rear-end visual task; the network in the decoder is obtained by adjusting the reconstruction loss and at least one association loss in the training process, the reconstruction loss represents the loss generated by compressing the characteristics of the sample image through the encoder and reconstructing the characteristics through the decoder, and the association loss is used for reflecting the accuracy of a task result obtained by executing a back-end visual task by using the characteristics of the sample image reconstructed through the decoder.

Wherein, before reconstructing the compressed features with the decoder to obtain reconstructed features of the target image, the method further comprises: the multi-task network system is formed by sequentially connecting a feature extraction module, an encoder, a decoder and a visual task module, wherein the feature extraction module is used for extracting and obtaining features input to the encoder, and the visual task module is used for executing a rear-end visual task; inputting the sample image into a multi-task network system for image processing, and acquiring reconstruction loss and at least one association loss corresponding to a back-end visual task by utilizing image processing data generated after the sample image passes through a decoder in the multi-task network system; network parameters in the encoder and decoder are adjusted based on the reconstruction loss and at least one associated loss.

The method for obtaining the reconstruction loss and at least one association loss corresponding to the back-end visual task by inputting the sample image into the multi-task network system for image processing and utilizing the image processing data generated after the decoder in the multi-task network system comprises the following steps: extracting features of the sample image by using a feature extraction module to obtain original sample features; compressing the original sample characteristics by using an encoder to obtain compressed sample characteristics; reconstructing the compressed sample features by using a decoder to obtain reconstructed sample features; obtaining reconstruction loss based on the original sample characteristics and first reconstruction characteristics, wherein the first reconstruction characteristics are reconstructed sample characteristics or are obtained by carrying out preset treatment on the reconstructed sample characteristics; and acquiring at least one of a first association loss and a second association loss corresponding to the back-end visual task, wherein the first association loss is obtained based on a first reconstruction feature and a second reconstruction feature of a reference image related to the sample image, the first reconstruction feature and the second reconstruction feature are features obtained by respectively carrying out the same processing on the sample image and the reference image by using a multi-task network system, and the second association loss is obtained based on a visual processing result output by the visual task module.

The method comprises the steps of reconstructing the compression characteristic by using a decoder to obtain the reconstruction characteristic of the target image, wherein the method comprises the following steps: performing dimension-lifting transformation on the compression characteristics by using a dimension-lifting transformation network to obtain dimension-lifting characteristics; and performing inverse quantization processing on the dimension-lifting characteristic to obtain a reconstruction characteristic.

The method for performing dimension-lifting transformation on the compression characteristic by using the inverse dimension transformation network to obtain dimension-lifting characteristic comprises the following steps: performing dimension lifting processing on the compression characteristics to obtain initial dimension lifting characteristics; performing linear mapping on the initial dimension-increasing characteristics to obtain mapping characteristics; up-sampling the mapping characteristics to obtain dimension-increasing characteristics; performing inverse quantization processing on the dimension-lifting characteristic to obtain a reconstruction characteristic, wherein the inverse quantization processing comprises the following steps: receiving grouping information of each characteristic value in the compression characteristics sent by the encoding end; dividing the compressed features into a plurality of feature groups based on the grouping information; and respectively performing inverse quantization processing on each group of feature groups by utilizing different quantization parameters to obtain the dimension-increasing features.

In order to solve the technical problems, one technical scheme adopted by the application is as follows: there is provided a method of training a codec, the method comprising: extracting features of the sample image by using a feature extraction module to obtain original sample features; compressing the original sample characteristics by using an encoder to obtain compressed sample characteristics; reconstructing the compressed sample features by using a decoder to obtain reconstructed sample features; based on the original sample characteristics and decoding reconstruction characteristics, reconstruction loss is obtained, and the decoding reconstruction characteristics are reconstructed sample characteristics or are obtained by carrying out preset processing on the reconstructed sample characteristics; obtaining at least one association loss corresponding to the back-end visual task based on at least one of the decoding reconstruction features and the visual processing results output by the visual task module, wherein the visual processing results are obtained by the visual task module performing the back-end visual task based on the decoding reconstruction features; network parameters in the encoder and decoder are adjusted based on the reconstruction loss and at least one associated loss.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided an electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement any of the above-mentioned feature codec methods or codec training methods.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided a computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement any of the above-described feature codec methods or codec training methods.

According to the scheme, the feature extraction module is utilized to extract the features of the target image, so that the original features are obtained; compressing the original features by using an encoder to obtain compressed features of the target image, wherein the compressed features are used for being provided for a task end to reconstruct features by using a decoder, and executing a back-end visual task based on the reconstructed features; the network parameters in the encoder are obtained by adjusting the reconstruction loss and at least one association loss in the training process, and the encoder can be more suitable for the accurate encoding of the back-end visual task through the adjustment of the reconstruction loss and the at least one association loss, so that the accuracy of image compression is improved, and the accuracy of reconstruction characteristics and the accuracy of the back-end visual task are further improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a feature encoding and decoding method of the present application;

FIG. 2 is a flow chart of another embodiment of the step S110 of the present application;

FIG. 3 is a flow chart illustrating another embodiment of the step S120 of the present application;

FIG. 4 is a flow chart of another embodiment of the feature codec method of the present application;

FIG. 5 is a flow chart of a further embodiment of the feature codec method of the present application;

FIG. 6 is a flowchart illustrating another embodiment of step S520 of the present application;

FIG. 7 is a schematic diagram of an embodiment of a self-attention network in accordance with the present application;

FIG. 8 is a schematic diagram of an embodiment of a dimension transformation network and an inverse dimension transformation network in the present application;

FIG. 9 is a flow chart of another embodiment of step S530 of the present application;

FIG. 10 is a schematic diagram illustrating the architecture of one embodiment of a multi-tasking network system in the present application;

FIG. 11 is a flow chart of yet another embodiment of a feature codec method of the present application;

FIG. 12 is a flowchart illustrating another embodiment of step S1120 of the present application;

FIG. 13 is a flow chart of an embodiment of a method of training a codec of the present application;

FIG. 14 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 15 is a schematic diagram of a framework of one embodiment of a computer readable storage medium of the present application.

Detailed Description

In order to make the objects, technical solutions and effects of the present application clearer and more specific, the present application will be further described in detail below with reference to the accompanying drawings and examples. In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

It should be noted that, the feature encoding and decoding method in the present application relates to two electronic devices, namely, an encoding-end electronic device and a task-end electronic device, and the two electronic devices can communicate with each other. The encoding-side electronic device may be configured to perform a feature encoding-decoding method including the feature extraction and encoding-compression related steps, and the task-side electronic device may be configured to perform a feature encoding-decoding method including the feature reconstruction related steps, and may further include the back-end visual task related steps. Either the encoding-side electronic device or the task-side electronic device may be used to perform the training-related steps, or the training-related steps may also be performed with other electronic devices. In a specific application scenario, the encoding-side electronic device performs relevant steps of feature extraction and encoding compression to obtain compressed features, the compressed features are sent to the task-side electronic device, and the task-side electronic device receives the compressed features and performs relevant steps of feature reconstruction and back-end visual tasks.

Referring to fig. 1, fig. 1 is a flow chart illustrating an embodiment of a feature encoding and decoding method of the present application. In this embodiment, the implementation main body is taken as an encoding-side electronic device for example, and the encoding-side electronic device may be used for completing the training of the encoder before performing the related steps including feature extraction and encoding compression. Specifically, the method may include the steps of training a network in an encoder:

step S110: the sample image is input into a multitasking network system for image processing.

The multi-task network system can be formed by sequentially connecting a feature extraction module, an encoder, a decoder and a visual task module. The visual task module is used for executing the back-end visual task.

During the training process, a multi-task network system may be run in the training device to image process the sample images to obtain reconstruction losses and at least one correlation loss for adjusting parameters of the network comprised by the encoder and decoder.

Step S120: the reconstruction loss and at least one association loss corresponding to the back-end visual task are obtained using image processing data generated after passing through the decoder in the multitasking network system.

The image processing data generated after being processed by the feature extraction module, the encoder and the decoder in the multitasking network system can be utilized.

The reconstruction loss represents a loss generated after the characteristics of the sample image are compressed by the encoder and reconstructed by the decoder, and the correlation loss is used for reflecting the accuracy of a task result obtained by performing a back-end visual task by using the characteristics of the sample image reconstructed by the decoder. Further, the accuracy of the task results reflecting the back-end visual task performed using the characteristics of the sample image reconstructed by the decoder may be reflected by at least one way of obtaining the correlation loss.

Step S130: parameters of a network comprised by the encoder and the decoder are adjusted based on the reconstruction loss and at least one associated loss.

The network parameters in the encoder are obtained by adjusting the reconstruction loss and at least one association loss in the training process, and the encoder can be more suitable for the accurate encoding of the back-end visual task through the adjustment of the reconstruction loss and the at least one association loss so as to improve the accuracy of compression characteristics, and further improve the accuracy of the reconstruction characteristics and the accuracy of the back-end visual task.

After the training of the encoder is completed, the encoding-end electronic device can utilize the feature extraction module to extract the features of the target image to obtain original features, utilize the trained encoder to compress the original features to obtain compressed features of the target image, provide the compressed features for the task-end electronic device to reconstruct the features by utilizing the decoder, and execute the back-end visual task based on the reconstructed features.

In the training process, the encoder and the decoder are trained simultaneously, so the above steps can be regarded as the training step for the encoder and the training step for the decoder. The trained encoder and decoder may be run by the encoding side electronics and the task side electronics, respectively.

The decoder network is adjusted through the reconstruction loss and at least one association loss, so that the decoder can be more suitable for the back-end visual task to accurately decode, the accuracy of the reconstruction characteristics is improved, and the accuracy of the back-end visual task can be further improved.

In some embodiments, after obtaining the reconstruction loss and the at least one association loss, the at least one association loss may also be used to adjust network parameters in a learnable network of the multi-tasking network system corresponding to the association loss.

Referring to fig. 2, fig. 2 is a flowchart illustrating another embodiment of step S110 of the present application. Specifically, step S110 may include the steps of:

step S211: and carrying out feature extraction on the sample image by utilizing a feature extraction module to obtain original sample features.

Step S212: and compressing the original sample characteristics by using an encoder to obtain compressed sample characteristics.

Step S213: and reconstructing the compressed sample characteristics by using a decoder to obtain reconstructed sample characteristics.

The reconstructed sample feature may be directly used as the first reconstructed feature, or may be obtained after a preset process is performed on the reconstructed sample feature. The first reconstruction feature may be used in an input vision processing network to implement step S214.

Step S214: and performing a back-end visual task based on the first reconstruction feature by using a visual task module to obtain a visual processing result.

In some embodiments, before compressing the original sample features with the encoder to obtain compressed sample features, the method may further comprise: and taking the original sample characteristic as a first characteristic to be processed, preprocessing the first characteristic to be processed to obtain a preprocessed characteristic, wherein the preprocessing is used for adjusting the data distribution in the first characteristic to be processed, and the preprocessed characteristic can be used for inputting an encoder for compression.

The data distribution of the characteristics of the input encoder is adjusted to be consistent through preprocessing, and the subsequent encoding and decoding processes are started, so that the characteristic representation distribution tends to be consistent (in the same order of magnitude), the compression difficulty is reduced, and the generalization of the encoding and decoding model can be improved.

Further, preprocessing may be implemented by conventional algorithms, such as data normalization, etc., or may be implemented by a learning network, such as spatial feature transformation using scaling factors and shifting factors, etc.

If preprocessing is implemented using conventional algorithms, the method may further include, prior to deriving the reconstruction loss based on the original sample characteristics and the first reconstruction characteristics: and taking the reconstructed sample characteristic as a second to-be-processed characteristic, and performing inverse pretreatment corresponding to pretreatment on the second to-be-processed characteristic to obtain an inverse pretreatment characteristic, wherein the inverse pretreatment characteristic is the first reconstruction characteristic.

It will be appreciated that if preprocessing is implemented using conventional algorithms, some preprocessing related parameters may be used in the preprocessing process, and these related parameters may be transmitted to the decoder along with the compressed sample characteristics for use in the inverse preprocessing process after the decoder outputs reconstructed sample characteristics.

Illustratively, preprocessing the first feature to be processed to obtain a preprocessed feature may be achieved by: and carrying out data normalization on the first feature to be processed to obtain a preprocessing feature.

In a specific application scenario, performing data normalization may specifically include transforming the first feature to be processed to a normalized distribution with a mean value of 0 and a variance of 1, reducing the overall brightness, and highlighting the individual differences:

input ^′ ＝(input-mean)/std

wherein input is ^′ Representing the normalized first feature to be processed, input representing the first feature to be processed, mean representing the mean of the first feature to be processed, std representing the variance of the first feature to be processed. And then dividing the normalized first feature to be processed by the norm, and further adjusting the data distribution:

out＝input ^′ /norm

where out denotes the preprocessing feature and norm denotes the norm of the first feature to be processed, which may include, but is not limited to, 1-norm, 2-norm, p-norm, etc. The mean, variance and norm values may then be used as preprocessing related parameters, along with the compressed sample characteristics, to the electronic device performing the reconstruction for inverse preprocessing.

In a specific application scenario, the inverse preprocessing may be implemented by the following formula:

out1＝(input1*norm)*std+mean

Where input1 represents the second feature to be processed and out1 represents the inverse pre-processing feature.

If a learnable network is adopted to implement preprocessing, the inverse processing of the preprocessing operation can be learned by using the network included in the decoder, so that the reconstructed sample feature output by the decoder can be directly used as the first reconstructed feature without additional inverse preprocessing, and after the reconstructed sample feature (the first reconstructed feature) is output by the decoder, the reconstruction loss can be obtained by directly executing the method based on the original sample feature and the first reconstructed feature at the moment, and the first reconstructed feature can also be used for calculating the association loss.

Illustratively, preprocessing the first feature to be processed to obtain a preprocessed feature may be achieved by: and obtaining a scaling factor and a shifting factor by utilizing preset network learning, and performing spatial feature transformation on the first feature to be processed based on the scaling factor and the shifting factor to obtain a preprocessing feature.

In a specific application scenario, a scaling factor and a shifting factor are learned using a lightweight network, respectively:

scale＝W ₁ (input)

shift＝W ₂ (input)

wherein input represents a first feature to be processed, scale and shift represent scaling factors and shifting factors obtained through network learning, respectively. W (W) ₁ And W is ₂ Each represents a neural network and may include, but is not limited to, convolutional, fully-connected, and convolutional-combined fully-connected networks, etc. And then carrying out space feature transformation on the first feature to be processed by using the scaling factor and the shifting factor to obtain a preprocessing feature:

out＝scale*input+shift

where out represents a preprocessing feature.

Referring to fig. 3, fig. 3 is a flowchart illustrating another embodiment of step S120 of the present application. Specifically, step S120 may include the steps of:

step S321: based on the original sample feature and the first reconstruction feature, a reconstruction loss is obtained.

The original sample characteristics are input into the encoder for compression and subsequent reconstruction, the first reconstruction characteristics are reconstruction results, and losses in the compression reconstruction process can be obtained by comparing the original sample characteristics with the first reconstruction characteristics so as to be used for adjusting parameters in a network contained in the encoder and the decoder, so that the encoder and the decoder can accurately compress and reconstruct, and the accuracy of reconstruction is improved.

Further, the comparison between the original sample feature and the first reconstruction feature may be performed by normalizing (may be implemented by sigmoid, softmax, etc.) or normalizing the two, and then calculating the reconstruction loss by using the difference between the two, where the reconstruction loss may be calculated by using L2 norm, L1 norm, MS-SSIM, etc.

In a specific application scenario, the original sample feature is denoted by X1, the first reconstruction feature is denoted by X2, and the normalization is performed by using sigmoid:

X2 ₁ ＝Sigmoid(X2)

X1 ₁ ＝Sigmoid(X1)

and then calculating reconstruction loss for the normalized original sample characteristic and the first reconstruction characteristic by adopting an L2 norm:

where loss_1 represents reconstruction loss, and H and W represent the height and width of the feature vector, respectively.

Step S322: at least one of a first association loss and a second association loss corresponding to the back-end visual task is acquired.

The order of execution of the above steps S321 and S322 is not limited, and both may be executed simultaneously or either may be executed first. The first association loss is obtained based on a first reconstruction feature and a second reconstruction feature of a reference image related to the sample image, wherein the first reconstruction feature and the second reconstruction feature are features obtained by respectively carrying out the same processing on the sample image and the reference image by using a multi-task network system. The second association loss is based on the visual processing result output by the visual task module. The back-end visual task may be one of image detection, image recognition, image classification, image re-recognition, and the like.

It should be noted that the visual task module may be composed of at least one of a learnable network and a non-learnable network. Depending on whether the visual task module is learnable or not, different correlation losses may be employed to reflect the accuracy of the task results obtained from performing the back-end visual task using the characteristics of the sample image reconstructed by the decoder.

Specifically, in response to the visual task module being a non-learnable network, obtaining a first association loss; in response to the visual task module being a learnable network, the first association loss and/or the second association loss is obtained.

In some embodiments, the visual task module includes a first branch and a second branch, the first branch being a non-learnable network, the second branch being a learnable network, each branch being independently operable to perform the back-end visual task, and similarly, in response to the first branch being a non-learnable network, obtaining a first association loss; in response to the second leg being a learnable network, the first association loss and/or the second association loss is acquired.

In some embodiments, in the training process, if the vision task module is a non-learning network, the step S214 of inputting the sample image into the multi-task network system for image processing may not be included, and the steps of obtaining the reconstruction loss and the first association loss corresponding to the back-end vision task and adjusting the parameters may not be affected. The manner of calculating the association loss may be varied based on the specific content of the back-end visual task, for example, contrast loss, triplet loss, quadruple loss, etc.

Specifically, the step of obtaining the first association loss may include: and obtaining a first association loss by using the similarity between the first reconstruction feature and the second reconstruction feature of each reference image, wherein the reference image comprises positive samples and negative samples of the sample image.

In a specific application scenario, the visual task module is an relearnable task module, and then the correspondence may employ a first association loss to reflect the accuracy of task results obtained by performing the back-end visual task using the features reconstructed by the sample image via the decoder. For re-identification tasks, the first association loss may take the form of a triplet loss. The input multitasking network system performs image processing including sample image (Anchor, X _anc Representation) and a reference image corresponding to the sample image, the reference image comprising a positive sample (denoted by X) of the sample image _pos Represented) and negative samples (represented by X _neg Representation), positive samples are of the same class as the sample image, and negative samples are of different classes as the sample image. The sample image, positive and negative samples are feature extracted and encoded and decoded to obtain a first reconstructed feature (x2_ anc) of the sample image and second reconstructed features (x2_pos and x2_neg, respectively) of the positive and negative samples. The first associated loss using the triplet loss may be calculated by the following formula:

Where loss_2 represents a first correlation loss, α represents a distance between a first reconstructed feature of the sample image and a second reconstructed feature of the positive sample, and a minimum separation between distances between the first reconstructed feature of the sample image and the second reconstructed feature of the negative sample.

In particular, the step of obtaining the second association loss may comprise: and obtaining a second association loss by utilizing the visual processing result output by the visual task module and the actual visual result marked by the sample image. The manner in which the second association loss is calculated may also be selected based on the type of visual task.

In a specific application scenario, the visual task module is a learning network and is used for executing an image classification task, a sample image can be marked with an actual visual result, specifically an image type to which the sample image belongs, the visual task module is used for carrying out a back-end visual task based on a first reconstruction feature to obtain a visual processing result, the visual processing result represents classification prediction obtained for the sample image, and a second association loss can be obtained based on a difference between the visual processing result and the actual visual result.

It should be noted that, in the case that the visual task module is a learnable network, the first association loss may be selected to be used, or the first association loss and the second association loss may be selected to be combined, so as to reflect the accuracy of the task result obtained by performing the back-end visual task by using the characteristics reconstructed by the decoder of the sample image.

In some embodiments, the total optimization function is obtained by weighted summation of reconstruction loss and at least one associated loss, and the weight value of each loss is a super parameter, which can be set according to actual needs:

loss＝a1*loss_1+a2*loss _- 2+a3*loss_3

where loss_3 represents the second associated loss, and a1, a2, and a3 are weights of the reconstruction loss, the first associated loss, and the second associated loss, respectively.

In some embodiments, the multitasking network system may also output back-end visual task accuracy. Specifically, an example of a relearnable task of re-recognition will be described. The purpose of the re-identification task is to calculate the similarity (e.g. euclidean distance and cosine distance etc.) using the first reconstructed feature of the sample image (called query with id) and the feature library of the original image library (called gamma with id). The feature library contains feature vectors for each image in the image library, where there are N identical ids to the query feature vectors. And sequencing the calculated similarities, taking the features of the first M (M > =N) similarities, judging whether the ids of the features are matched with the ids of the first reconstruction features, and if the features are completely consistent, obtaining 100% of accuracy.

The first reconstruction characteristic of the sample image is X _query Id is Iquery; feature library X _gallery . The number of elements in the feature vector is L. The method comprises the following steps:

First reconstruction feature X _query And respectively calculating cosine distances from all feature vectors of the feature library:

wherein a is _i Representing a first reconstruction feature X _query And the ith feature vector in the feature library

For measuring similarity.

And sorting the cosine distances calculated by the above formula from small to large, and taking out the feature vector of M with the cosine distance in front from the feature library.

Comparing the first reconstructed feature X _query And ids of the first M feature vectors. If the number of ids is N1, the identification accuracy is:

if N1 is equal to N, the recognition rate is 100%. The recognition accuracy can reflect the accuracy of the task result obtained by the back-end visual task and can be used for reference of a user.

Referring to fig. 4, fig. 4 is a flowchart illustrating another embodiment of the feature encoding and decoding method of the present application. In this embodiment, the electronic device may be configured to perform the relevant training steps of the encoder, such as step S110-step S130 in the previous embodiments. After training is completed, the electronic device may then process the target image using the feature extraction module and the encoder. Specifically, the method may comprise the steps of:

step S410: and extracting the characteristics of the target image by utilizing a characteristic extraction module to obtain original characteristics.

The feature extraction module may be a feature extraction network or a feature extraction algorithm, wherein the feature extraction network may include, but is not limited to ResNet, efficientNet, shuffleNet and MobileNet, etc. Feature extraction algorithms may include, but are not limited to, SIFT operators, HOG algorithms, haar algorithms, and the like.

In some embodiments, before performing step S420, the method may further include: and preprocessing the first feature to be processed by taking the original feature as the first feature to be processed to obtain a preprocessed feature, wherein the preprocessing is used for adjusting the data distribution in the first feature to be processed, and the preprocessed feature is used for being input into the encoder for compression. For a specific description of the preprocessing reference is made to the relevant content in the previous embodiments.

It should be noted that if preprocessing is implemented by using a conventional algorithm, some preprocessing related parameters may be used in the preprocessing process, and the related parameters may be transmitted to the task-side electronic device together with the compressed features, so as to be used in the inverse preprocessing process after the reconstructed sample features are output by the decoder.

Step S420: and compressing the original characteristics by using an encoder to obtain the compressed characteristics of the target image.

Wherein the network parameters in the encoder are adjusted during the training based on the reconstruction loss and the at least one correlation loss. The reconstruction loss represents the loss generated after the characteristics of the sample image are compressed by the encoder and reconstructed by the decoder, and the correlation loss is used for reflecting the accuracy of task results obtained by performing the back-end visual task on the characteristics of the sample image reconstructed by the decoder. The compressed features may be used to provide to the task side electronics to cause the task side electronics to perform feature reconstruction using the decoder and perform back-end visual tasks based on the reconstructed features.

It should be noted that, the electronic device in this embodiment may communicate with the task-side electronic device. If the preprocessing is implemented by adopting a conventional algorithm, some preprocessing related parameters may be used in the preprocessing process, and the preprocessing related parameters may be transmitted to the task-side electronic device together with the compression characteristics, so that the task-side electronic device is used for completing the inverse preprocessing operation.

Referring to fig. 5, fig. 5 is a flowchart illustrating a feature encoding and decoding method according to another embodiment of the present application. In this embodiment, the electronic device may be configured to perform the relevant training steps of the encoder, such as step S110-step S130 in the previous embodiments. After training is completed, the electronic device may then process the target image using the feature extraction module and the encoder. Specifically, the method may comprise the steps of:

Step S510: and extracting the characteristics of the target image by utilizing the characteristic extraction network to obtain original characteristics.

Step S510 may refer to the related description of feature extraction in the foregoing embodiments, which is not described herein.

Step S520: and performing dimension reduction transformation on the original characteristics by using a dimension transformation network to obtain dimension reduction characteristics.

Step S530: and carrying out quantization treatment on the dimension reduction characteristics to obtain compression characteristics.

Wherein the encoder comprises a dimension transformation network and a quantization module, the dimension transformation network may be used to implement step S520, and the quantization module may be used to implement step S530. Step S420 may be implemented by step S520 and step S530. Both the dimension reduction transformation and the quantization processing belong to means for compression.

Referring to fig. 6, fig. 6 is a flowchart illustrating another embodiment of step S520 of the present application. Specifically, step S520 may include the steps of:

step S621: and obtaining the feature to be input by using the original feature.

In particular, deriving the feature to be input using the original feature may include the steps of: downsampling the original features to obtain downsampled features; partitioning the downsampled features to obtain a plurality of feature blocks; and linearly embedding the plurality of feature blocks to obtain the feature to be input.

Further, the original features are downsampled, namely the original features are subjected to dimension reduction, so that downsampled features with smaller space dimensions are obtained, wherein the downsampling can be realized through convolution, full connection, bicubic downsampling and other methods. The plurality of feature blocks are respectively linearly embedded, and the linear embedding may include: the feature blocks are flattened into line vectors and mapped to specific dimension sizes using a learnable network layer to obtain features to be input.

In a specific application scenario, the original feature dimension is h×w×c, and downsampling is performed on the original feature to obtain a downsampled feature having a dimension of (H/4) × (W/4) ×c. The downsampled features are partitioned into blocks, each feature block (patch) dimension being p×p×c, with num_patch=h/4 p×w/4P feature blocks in total. And (3) carrying out linear mapping on each feature block by adopting full connection (or convolution), and finally outputting a block embedded vector with the number of num_patch (H/4P multiplied by W/4P) and the dimension of C1, namely, outputting the dimension as follows: [ num_patch, channel ] = [ H/4P×W/4P, C1], at this time, the feature map dimension of all feature block compositions is [ H/4P, W/4P, C1].

Step S622 represents performing at least the first iterative process to obtain the target attention feature. The first input features of the first iteration process are features to be input, and the first input features of the non-first iteration process are self-attention processing results of the last first iteration process.

Step S622: the first input feature is self-attentive processed with a first self-attentive network of the dimension-transformation network.

The dimension transformation network comprises a plurality of first iteration processing modules which are sequentially connected, the first iteration processing modules comprise a first self-attention network, one first iteration processing module is used for realizing first iteration processing once, and the number of the first iteration processing depends on the number of the first iteration processing modules in the dimension transformation network.

Step S623 represents a second iteration process performed at least once to obtain a dimension-reduction feature. The second input features of the first second iterative process are target attention features, and the second input features of the non-first second iterative process are self-attention processing results of the last second iterative process.

Step S623: and performing dimension reduction processing on the second input feature to obtain dimension reduction processing features, and performing self-attention processing on the dimension reduction processing features by using a second self-attention network of the dimension conversion network.

The dimension transformation network comprises a plurality of second iteration processing modules which are sequentially connected, the second iteration processing modules comprise dimension reduction processing modules and a second self-attention network which are sequentially connected, one second iteration processing module is used for realizing one second iteration processing, and the number of the second iteration processing depends on the number of the second iteration processing modules in the dimension transformation network.

It should be noted that the first self-focusing network/the second self-focusing network in the dimension transformation network may be a Swin Transformer or Vision Transformer (ViT) network, etc.

In the process of dimension transformation and inverse dimension transformation, a transformation network based on a self-attention network is used for a coder/decoder, and based on self-attention processing, the remote dependency relationship of a data airspace can be fully utilized when dimension reduction transformation is carried out, so that the coding/decoding performance is improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a self-focusing network in the present application.

Wherein, fig. 7 (a) shows a schematic structural diagram of the Swin transducer module, fig. 7 (b) shows a schematic structural diagram of the Vision Transformer (ViT) module, and the Vision Transformer (ViT) module includes L units. The first self-care network may be the Swin transducer module or Vision Transformer (ViT) module given in fig. 7, and the second self-care network may be the Swin transducer module or Vision Transformer (ViT) module given in fig. 7.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an embodiment of a dimension transformation network and an inverse dimension transformation network in the present application.

In fig. 8, (a) is a dimension transformation network. Fig. 8 (b) shows an inverse dimensional transform network. The dimension transformation network comprises a downsampling module, a feature block dividing module, a linear embedding module, M first iteration processing modules and N second iteration processing modules. And inputting original features, and processing the original features by a downsampling module, a feature block dividing module and a linear embedding module to obtain features to be input. And inputting the feature to be input into M first iteration processing modules, and executing M first iteration processing to obtain the target attention feature. The target attention feature is input into N second iteration processing modules, and N second iteration processing is executed to obtain the dimension reduction feature.

In a specific application scenario, the dimension transformation network includes 2 first iterative processing modules and 3 second iterative processing modules, and in addition, one of the first self-focusing network and the second self-focusing network may be a Vision Transformer (ViT) module, and one Vision Transformer (ViT) module includes 3 units, each unit being shown in fig. 7 (b).

Specifically, performing the dimension reduction processing on the second input feature to obtain a dimension reduction processing feature may include: dividing a plurality of feature blocks in the second input features into a plurality of parts; each feature block is respectively formed into a new feature block, and a plurality of new feature blocks are utilized to form the dimension reduction processing feature.

In a specific application scenario, the self-attention network may not change the feature dimension, and the second input feature includes num_patch feature blocks. And (5) splicing all the characteristic blocks in four parts. The dimension is thus changed from [ num_patch, channel ] ([ H/4P×W/4P, C1 ]) to [ num_patch/4, channel ] ([ H/8P×W/8P,4, C1 ]), which can be further reshaped to [ num_patch/4,4×C1] ([ H/8P×W/8P,4C1 ]). In the remodelling process, the number of the feature blocks is changed from num_patch to num_patch/4, and is reduced to one fourth of the original size, and the dimension of the feature map formed by all the feature blocks is changed from [ H/4P, W/4P, C1] to [ H/8P, W/8P,4C1], so that the reduction of the space dimension is realized.

Referring to fig. 9, fig. 9 is a flowchart illustrating another embodiment of step S530 of the present application. Specifically, step S530 may include the steps of:

step S931: and determining a plurality of characteristic value intervals based on the characteristic value distribution condition in the dimension reduction characteristic.

Specifically, feature values in the dimension reduction features are counted to obtain feature concentration trend characterization values of the dimension reduction features, the feature concentration trend characterization values are multiplied by different ratios to obtain a plurality of interval boundary values, and a plurality of feature value intervals are obtained by utilizing the plurality of interval boundary values.

Step S932: and dividing the feature values in the dimension reduction feature into a plurality of feature groups by using a plurality of feature value intervals.

Step S933: and respectively carrying out quantization processing on each group of characteristic groups by utilizing different quantization parameters so as to obtain compression characteristics.

Specifically, for each feature group, normalizing each feature value in the feature group to obtain a normalized result of each feature value, quantizing the normalized result of each feature value in the feature group by using a quantization parameter corresponding to the feature group to obtain a quantized result of each feature value in the feature group, and obtaining a compressed feature by using the quantized result of each feature value in each feature value group.

In a specific application scenario, the feature-concentration-trend-representation value may be a feature mean, and in other embodiments, the feature-concentration-trend-representation value may also be represented as a median, or the like. Specifically, for the dimension reduction feature, the mean of the feature values is calculated:

the dimension of the dimension reduction feature X is [ M, N ]. Based on the characteristic central trend characterization value multiplied by different ratios, a plurality of interval boundary values are obtained, the number of the interval boundary values corresponds to the number of groups of the characteristic values, 3 groups are taken as an example for illustration, and the characteristic central trend characterization value can be obtained by the following formula:

Where a1 and a2 are two interval boundary values, based on which feature values in the dimension-reduction feature can be divided into three feature groups. The quantization parameters used for the quantization process are different for each feature group, and the quantization parameters represent the level of the quantization process.

And when the characteristic value in X is smaller than or equal to a1, classifying the characteristic value into a class 0, classifying the characteristic value into a class 1 when the value is larger than a1 and smaller than a2 as a first group, classifying the characteristic value into a class 1 when the value is larger than or equal to a2 as a second group, classifying the characteristic value into a class 2 when the value is larger than or equal to a2 as a third group. The

classes

0, 1 and 2 are quantized with 1, 2 and 4 bits, respectively. Taking class 2 as an example, the quantization bit k=4 (quantization parameter), and recording the class of data as X2, normalizing the data X2 by adopting a sigmoid function, and then quantizing to obtain a quantization result:

X2 _quan ＝round((X2)*(2 ^K )-0.5)

and obtaining compression characteristics based on quantization results of the three sets of characteristic values.

In the process of inverse quantization, quantization parameters are used, and the quantization parameters may be transmitted to the decoder together with the compression characteristics. In particular, during transmission, quantization parameters and compression characteristics exist in the form of integer types.

According to the scheme, a quantization method based on a clustering method is adopted, quantization of different bits is carried out on data in different value ranges, quantization precision is improved, and subsequent reconstruction performance is guaranteed.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an embodiment of a multi-task network system in the present application.

The feature extraction module may be a learning network or a conventional algorithm, and preprocessing shown in the figure is implemented by using the conventional algorithm, so that an inverse preprocessing module is included behind the decoder. The reconstruction loss may be taken as the reconstruction loss, the multi-task loss_2 may be taken as the first association loss, and the multi-task loss_3 may be taken as the second association loss, and the multi-task loss may be taken as the first association loss, and the multi-task loss may be taken as the second association loss, and the multi-task loss may be taken as the actual visual loss, wherein the actual visual loss is based on the visual processing result output by the visual task module and the sample image label contained in the original image data.

In the training process, the first association loss is used for adjusting network parameters of the learnable network in the feature extraction module, the encoder and the decoder corresponding to the first association loss. The second association loss is used to adjust network parameters of the learner network in the feature extraction module, encoder, decoder, and visual task module corresponding thereto.

Referring to fig. 11, fig. 11 is a flowchart illustrating a feature encoding and decoding method according to another embodiment of the present application. In this embodiment, the execution body is a task-side electronic device, and may be used to execute relevant steps including feature reconstruction and a back-end visual task, and further, before executing relevant steps including feature reconstruction and a back-end visual task, training of the decoder may be completed. The relevant training steps of the decoder may be referred to the relevant description in the previous embodiments.

It should be noted that, because the encoder and decoder are trained together, any one of the two electronic devices is used as a training device to perform the relevant step of training, and the other electronic device is used as a task end to perform the relevant steps of feature extraction and encoding compression by using the feature extraction module and the trained encoder, and the other electronic device is used as a task end to perform the relevant steps of feature reconstruction and back-end visual task by using the trained decoder and the visual task module. In this embodiment, a task end is taken as an example of training equipment. Specifically, the method may comprise the steps of:

step S1110: a compression characteristic of a target image obtained with an encoder is received.

The task side electronics may communicate with the electronics running the encoder to obtain the compression characteristics of the target image obtained with the encoder.

In a specific application scenario, the electronic device running the encoder may be a camera or other devices connected to the camera, the target image may be an image collected by the camera, and the task-side electronic device may be a server, configured to reconstruct compression characteristics of the target image sent by the camera, so as to implement a back-end visual task.

It should be noted that, before step S1110, the device may perform training related steps, for example, step S110 to step S130 in the foregoing embodiment, and the detailed description may refer to the related content in the foregoing embodiment, which is not described herein. During training, the network in the decoder is adjusted based on the reconstruction loss and at least one correlation loss. The reconstruction loss represents the loss generated after the characteristics of the sample image are compressed by the encoder and reconstructed by the decoder, and the associated patrol is used for reflecting the accuracy of task results obtained by performing the back-end visual task by using the characteristics of the sample image reconstructed by the decoder.

Step S1120: and reconstructing the compression characteristic by using a decoder to obtain the reconstruction characteristic of the target image.

The reconstruction feature of the target image may be used to implement a back-end visual task, and in particular, a visual task module may be used to process the reconstruction feature to obtain a visual processing result about the target image.

In some embodiments, if the encoding end adopts a conventional algorithm to implement preprocessing, the method further includes taking the reconstructed feature of the target image as a second feature to be processed, and performing inverse preprocessing on the second feature to obtain an inverse preprocessing feature, where the inverse preprocessing feature may be used for inputting a visual task module to obtain a visual processing result about the target image. The description of the inverse preprocessing may refer to the relevant content in the foregoing embodiments, and will not be described herein.

In some embodiments, the encoding side electronic device may communicate with the task side electronic device, and the encoding side electronic device may send a code stream file to the task side electronic device, where the code stream file may include compression features, and in some cases, preprocessing related parameters and/or quantization parameters. The task-side electronic device may obtain the compression characteristics from the code stream file, and pre-process related parameters and/or quantization parameters.

Referring to fig. 12, fig. 12 is a flowchart illustrating another embodiment of step S1120 of the present application.

Specifically, step S1120 may include the steps of:

step S1221: and performing dimension-lifting transformation on the compression characteristics by using a dimension-lifting transformation network to obtain dimension-lifting characteristics.

The decoder may include an inverse dimensional transform network, which may be used to implement step S1221, and an inverse quantization module, which may be used to implement step S1222. The inverse dimension transformation network can realize inverse processing of the dimension transformation network, and the inverse quantization module can realize inverse quantization processing. The dimension-increasing feature dimension output by the inverse dimension transformation network is consistent with the original feature dimension in the input dimension transformation network.

Further, step S1221 may include the steps of: and carrying out dimension lifting processing on the compressed features to obtain initial dimension lifting features, carrying out linear mapping on the initial dimension lifting features to obtain mapping features, and carrying out up-sampling on the mapping features to obtain dimension lifting features.

Please refer to fig. 8 (b), which is an inverse dimension transformation network, wherein the inverse dimension transformation network includes N inverse second iteration processing modules, M inverse first iteration processing modules, a linear mapping module, a block splicing module, and an upsampling module. Specifically, one inverse second iterative processing module includes an up-scaling processing module and a third self-attention network, and one inverse first iterative processing module includes a fourth self-attention network. The compressed features are input into N anti-second iteration processing modules to perform dimension increasing processing and self-attention processing, the obtained results are input into M anti-first iteration processing modules to perform self-attention processing, and initial dimension increasing features are output. The initial dimension-increasing feature is used for being input into a linear mapping module, and the linear mapping module carries out linear mapping on the initial dimension-increasing feature to obtain a mapping feature. And performing block splicing processing on the mapping features to obtain spliced mapping features, and inputting the spliced mapping features into an up-sampling module to obtain dimension-increasing features.

The linear mapping, block splicing and up-sampling performed by the inverse dimension transformation network can be regarded as the inverse operation of linear embedding, feature blocking and down-sampling performed by the dimension transformation network.

In a specific application scenario, the third self-attention network/fourth self-attention network may be a Swin transducer module or a Vision Transformer (ViT) module. The step of expanding the input dimension by k times may include: the N-th dimension of the input dimension is increased by k times with a full connection or convolution layer, namely the new dimension: [ A, k.times.B ], then performing dimension recombination operation to obtain new dimensions: [ k.times.A, B ], which corresponds to increasing the number of feature blocks by a factor of k. The input dimension of the linear mapping module is [ P1, P1], and the output dimension after mapping is: [ P1, P1, C ]. And the block splicing is to splice a plurality of characteristic blocks to obtain a characteristic diagram. The feature map is dimension expanded by up-sampling so that the dimension of the dimension-up feature is consistent with the original feature dimension in the input dimension transformation network.

Step S1222: and performing inverse quantization processing on the dimension-lifting characteristic to obtain a reconstruction characteristic.

Wherein the inverse quantization process is an inverse process of the quantization process. Further, step S1222 may include the following steps: and receiving grouping information of each characteristic value in the compression characteristics sent by the encoding end, dividing the compression characteristics into a plurality of characteristic groups based on the grouping information, and respectively performing inverse quantization processing on each group of characteristic groups by utilizing different quantization parameters so as to obtain the dimension-increasing characteristics.

Wherein the grouping information can be used to identify the quantization parameter of the feature set, i.e. the grouping information contains the quantization parameter, and the code stream file is recorded as X2 after being read by taking class 2 as an example _quan Quantization bit k=4, then the inverse quantization result is:

X2 _dequan ＝(X2 _quan +0.5)/(2 ^K )

referring to fig. 13, fig. 13 is a flowchart illustrating an embodiment of a method for training a codec according to the present application. Specifically, the method may comprise the steps of:

step S1310: and carrying out feature extraction on the sample image by utilizing a feature extraction module to obtain original sample features.

Step S1320: and compressing the original sample characteristics by using an encoder to obtain compressed sample characteristics.

Step S1330: and reconstructing the compressed sample characteristics by using a decoder to obtain reconstructed sample characteristics.

Step S1340: based on the original sample features and the decoded reconstruction features, a reconstruction penalty is obtained.

The decoding reconstruction features are reconstructed sample features or are obtained by performing preset processing on the reconstructed sample features.

Step S1350: and obtaining at least one association loss corresponding to the back-end visual task based on at least one of the decoded reconstruction feature and the visual processing result output by the visual task module.

Here, step S1340 and step S1350 may be performed after step S1330, and both may be performed simultaneously, or either may be performed first.

Step S1360: network parameters in the encoder and decoder are adjusted based on the reconstruction loss and at least one associated loss.

The specific descriptions of step S1310 to step S1360 may refer to the relevant content in the foregoing embodiments, and are not described herein.

Referring to fig. 14, fig. 14 is a schematic diagram of a frame of an embodiment of the electronic device of the present application.

In this embodiment, the electronic device 140 includes a memory 141 and a processor 142, wherein the memory 141 is coupled to the processor 142. In particular, various components of the electronic device 140 may be coupled together by a bus, or the processor 142 of the electronic device 140 may be coupled to each other individually. The electronic device 140 may be any device having processing capabilities, such as a computer, tablet, cell phone, etc.

The memory 141 is used for storing program data executed by the processor 142, data during processing by the processor 142, and the like. For example, a reconstruction penalty, a first association penalty, a second association penalty, and the like. Wherein the memory 141 includes a nonvolatile storage portion for storing the above-described program data.

The processor 142 controls the operation of the electronic device 140, and the processor 142 may also be referred to as a CPU (Central Processing Unit ). The processor 142 may be an integrated circuit chip having signal processing capabilities. Processor 142 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 142 may be commonly implemented by a plurality of constituent circuit chips.

Processor 142 is operative to execute instructions to implement any of the feature codec methods or codec training methods described above by invoking program data stored in memory 141.

Referring to fig. 15, fig. 15 is a schematic diagram illustrating a framework of an embodiment of a computer readable storage medium according to the present application.

In this embodiment, the computer readable storage medium 150 stores program data 151 executable by a processor to implement any of the above-described feature codec methods or codec training methods.

The computer readable storage medium 150 may be a medium such as a usb (universal serial bus), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, which may store program data, or may be a server storing the program data, and the server may send the stored program data to another device for running, or may also run the stored program data itself.

In some embodiments, the computer readable storage medium 150 may also be a memory as shown in FIG. 14.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

Claims

1. A method for feature encoding and decoding, comprising:

Extracting features of the target image by using a feature extraction module to obtain original features;

compressing the original features by using an encoder to obtain compressed features of the target image, wherein the compressed features are used for being provided for a task end to reconstruct features by using a decoder, and executing a back-end visual task based on the reconstructed features;

the network parameters in the encoder are adjusted in the training process based on a reconstruction loss and at least one association loss, wherein the reconstruction loss represents the loss generated after the characteristics of a sample image are compressed by the encoder and reconstructed by the decoder, and the association loss is used for reflecting the accuracy of a task result obtained by performing the back-end visual task by using the characteristics of the sample image reconstructed by the decoder.

2. The method of claim 1, wherein prior to compressing the original features with an encoder to obtain compressed features of the target image, the method further comprises:

the multi-task network system is formed by sequentially connecting the feature extraction module, the encoder, the decoder and the visual task module, wherein the visual task module is used for executing the rear-end visual task;

Inputting the sample image into the multi-task network system for image processing, and acquiring the reconstruction loss and at least one association loss corresponding to the back-end visual task by utilizing image processing data generated after the decoder in the multi-task network system;

network parameters in the encoder and decoder are adjusted based on the reconstruction loss, and the at least one associated loss.

3. The method according to claim 2, wherein the inputting the sample image into the multi-tasking network system for image processing and obtaining the reconstruction loss and at least one association loss corresponding to the back-end visual task using image processing data generated after passing through the decoder in the multi-tasking network system comprises:

extracting features of the sample image by using the feature extraction module to obtain original sample features;

compressing the original sample characteristics by using the encoder to obtain compressed sample characteristics;

reconstructing the compressed sample features by using the decoder to obtain reconstructed sample features;

obtaining the reconstruction loss based on the original sample characteristics and first reconstruction characteristics, wherein the first reconstruction characteristics are obtained by carrying out preset processing on the reconstruction sample characteristics or the reconstruction sample characteristics; and

At least one of a first association loss and a second association loss corresponding to the back-end visual task is acquired, the first association loss is obtained based on the first reconstruction feature and a second reconstruction feature of a reference image related to the sample image, the first reconstruction feature and the second reconstruction feature are features obtained by respectively performing the same processing on the sample image and the reference image by using the multi-task network system, and the second association loss is obtained based on a visual processing result output by the visual task module.

4. The method of claim 3, wherein the obtaining at least one of a first association loss and a second association loss corresponding to the back-end visual task comprises:

responsive to the visual task module not being a learnable network, obtaining the first association loss;

responsive to the visual task module being a learnable network, obtaining the first association loss and/or the second association loss;

the step of obtaining the first association loss includes:

obtaining the first association loss by using the similarity between the first reconstruction feature and the second reconstruction feature of each reference image, wherein the reference image comprises a positive sample and a negative sample of the sample image;

The step of obtaining the second association loss includes:

and obtaining the second association loss by utilizing a visual processing result output by the visual task module and an actual visual result marked by the sample image, wherein the visual processing result is obtained by the visual task module performing the rear-end visual task based on the first reconstruction feature.

5. A method according to claim 3, wherein prior to said compressing the original sample features with the encoder to obtain compressed sample features, the method further comprises:

taking the original sample characteristic as a first characteristic to be processed;

preprocessing the first feature to be processed to obtain a preprocessed feature, wherein the preprocessing is used for adjusting data distribution in the first feature to be processed, and the preprocessed feature is used for inputting the encoder for compression;

and/or, before the obtaining the reconstruction loss based on the original sample feature and the first reconstruction feature, the method further comprises:

taking the reconstructed sample characteristic as a second to-be-processed characteristic;

and performing inverse pretreatment corresponding to the pretreatment on the second feature to be treated to obtain an inverse pretreatment feature, wherein the inverse pretreatment feature is the first reconstruction feature.

6. The method of claim 1, wherein prior to compressing the original features with an encoder to obtain compressed features of the target image, the method further comprises:

taking the original feature as a first feature to be processed;

preprocessing the first feature to be processed to obtain a preprocessed feature, wherein the preprocessing is used for adjusting data distribution in the first feature to be processed, and the preprocessed feature is used for being input into the encoder for compression.

7. The method according to claim 5 or 6, wherein the preprocessing the first feature to be processed to obtain a preprocessed feature comprises:

carrying out data normalization on the first feature to be processed to obtain the preprocessing feature; or alternatively, the process may be performed,

and obtaining a scaling factor and a shifting factor by utilizing preset network learning, and carrying out spatial feature transformation on the first feature to be processed based on the scaling factor and the shifting factor to obtain the preprocessing feature.

8. The method of claim 1, wherein compressing the original feature with an encoder results in a compressed feature of the target image, comprising:

Performing dimension reduction transformation on the original features by using a dimension transformation network to obtain dimension reduction features;

and carrying out quantization treatment on the dimension reduction feature to obtain the compression feature.

9. The method of claim 8, wherein performing the dimension reduction on the original feature using the dimension transformation network to obtain a dimension reduction feature comprises:

obtaining the feature to be input by utilizing the original feature;

at least one of the following first iterative processes is performed to obtain a target attention characteristic: performing self-attention processing on a first input feature by using a first self-attention network of the dimension transformation network, wherein the first input feature of the first iterative processing is the feature to be input, and the first input feature of the first iterative processing which is not the first input feature of the first iterative processing is the self-attention processing result of the last iterative processing;

at least one of the following second iterative processes to obtain the dimension-reduction feature: and performing dimension reduction processing on the second input features to obtain dimension reduction processing features, and performing self-attention processing on the dimension reduction processing features by using a second self-attention network of the dimension transformation network, wherein the second input features of the first iteration processing are the target attention features, and the second input features of the second iteration processing which are not the first are self-attention processing results of the last iteration processing.

10. The method of claim 9, wherein the obtaining the feature to be input using the original feature comprises:

downsampling the original features to obtain downsampled features;

partitioning the downsampled features to obtain a plurality of feature blocks;

linearly embedding the plurality of feature blocks to obtain the feature to be input;

the step of performing the dimension reduction processing on the second input feature to obtain a dimension reduction processing feature comprises the following steps:

dividing the plurality of feature blocks in the second input feature into a plurality of shares;

and respectively forming each feature block into a new feature block, and forming the dimension reduction processing feature by utilizing a plurality of new feature blocks.

11. The method of claim 8, wherein said quantizing said dimension-reduction feature to obtain said compressed feature comprises:

determining a plurality of characteristic value intervals based on the characteristic value distribution conditions in the dimension reduction characteristics;

dividing the characteristic values in the dimension reduction characteristic into a plurality of characteristic groups by utilizing the plurality of characteristic value intervals;

and respectively carrying out quantization processing on each group of the characteristic groups by utilizing different quantization parameters so as to obtain the compression characteristic.

12. The method of claim 11, wherein the determining a number of feature value intervals based on the feature value distribution in the dimension-reduction feature comprises:

counting the feature values in the dimension reduction features to obtain feature concentration trend characterization values of the dimension reduction features;

multiplying the characteristic central tendency characterization value by different ratios to obtain a plurality of interval boundary values;

obtaining the plurality of characteristic value intervals by utilizing the plurality of interval boundary values;

said quantizing each of said feature sets with different quantization parameters to obtain said compressed features, respectively, comprising:

for each group of the feature groups, carrying out normalization processing on each feature value in the feature groups to obtain a normalization result of each feature value;

carrying out quantization processing on the normalization result of each characteristic value in the characteristic group by utilizing the quantization parameter corresponding to the characteristic group to obtain a quantization result of each characteristic value in the characteristic group;

and obtaining the compression characteristic by using the quantization result of each characteristic value in each characteristic group.

13. A method for feature encoding and decoding, comprising:

receiving a compression characteristic of the target image obtained by the encoder;

Reconstructing the compression characteristic by using a decoder to obtain a reconstruction characteristic of the target image, wherein the reconstruction characteristic is used for realizing a rear-end visual task;

the network in the decoder is adjusted based on a reconstruction loss and at least one association loss in a training process, wherein the reconstruction loss represents loss generated after characteristics of a sample image are compressed by the encoder and reconstructed by the decoder, and the association loss is used for reflecting accuracy of a task result obtained by performing the back-end visual task by using the characteristics of the sample image reconstructed by the decoder.

14. The method of claim 13, wherein prior to reconstructing the compressed features with a decoder, the method further comprises:

the multi-task network system is formed by sequentially connecting a feature extraction module, an encoder, a decoder and a visual task module, wherein the feature extraction module is used for extracting and obtaining features input to the encoder, and the visual task module is used for executing the rear-end visual task;

15. The method of claim 14, wherein inputting the sample image into the multi-tasking network system for image processing and obtaining the reconstruction loss and at least one association loss corresponding to the back-end visual task using image processing data generated in the multi-tasking network system after passing through the decoder, comprises:

16. The method of claim 15, wherein prior to said compressing the original sample features with the encoder to obtain compressed sample features, the method further comprises:

17. The method of claim 13, wherein reconstructing the compressed features using a decoder results in reconstructed features of the target image, comprising:

performing dimension-lifting transformation on the compression characteristic by using a dimension-lifting transformation network to obtain a dimension-lifting characteristic;

And performing inverse quantization processing on the dimension-increasing feature to obtain the reconstruction feature.

18. The method of claim 17, wherein said performing an up-scaling transformation on said compressed feature using an inverse dimensional transformation network to obtain an up-scaling feature comprises:

performing dimension lifting processing on the compressed features to obtain initial dimension lifting features;

performing linear mapping on the initial dimension-increasing characteristics to obtain mapping characteristics;

upsampling the mapping feature to obtain the dimension-increasing feature;

the inverse quantization processing is performed on the dimension-increasing feature to obtain the reconstruction feature, which comprises the following steps:

receiving packet information of each characteristic value in the compression characteristics sent by an encoding end;

dividing the compressed features into feature groups based on the grouping information;

and respectively performing inverse quantization processing on each group of feature groups by utilizing different quantization parameters so as to obtain the dimension-increasing features.

19. A method of training a codec, comprising:

extracting features of the sample image by using a feature extraction module to obtain original sample features;

compressing the original sample characteristics by using an encoder to obtain compressed sample characteristics;

Reconstructing the compressed sample features by using a decoder to obtain reconstructed sample features;

obtaining the reconstruction loss based on the original sample characteristics and decoding reconstruction characteristics, wherein the decoding reconstruction characteristics are obtained by carrying out preset processing on the reconstruction sample characteristics or the reconstruction sample characteristics;

obtaining at least one association loss corresponding to a back-end visual task based on at least one of the decoding reconstruction features and visual processing results output by a visual task module, wherein the visual processing results are obtained by the visual task module performing the back-end visual task based on the decoding reconstruction features;

20. An electronic device comprising a memory and a processor coupled to each other for executing program instructions stored in the memory to implement the feature codec method of any one of claims 1 to 12, the feature codec method of any one of claims 13-18, or the codec training method of claim 19.

21. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the feature codec method of any one of claims 1 to 12, the feature codec method of any one of claims 13-18, or the codec training method of claim 19.