CN114067009A

CN114067009A - Image processing method and device based on Transformer model

Info

Publication number: CN114067009A
Application number: CN202111232630.8A
Authority: CN
Inventors: 徐�明; 何潇; 刘强
Original assignee: Shenzhen ZNV Technology Co Ltd
Current assignee: Shenzhen ZNV Technology Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-02-18

Abstract

The invention provides an image processing method and device based on a Transformer model. The method comprises the following steps: acquiring JPEG image data; carrying out entropy decoding and inverse quantization processing on JPEG image data to obtain frequency domain information corresponding to the JPEG image data; constructing an input sequence which meets the input requirement of a Transformer model according to frequency domain information corresponding to JPEG image data; the input sequence is input into a Transformer model for training and/or reasoning on the Transformer model. By partially decoding JPEG image data, the image decoding time is effectively shortened, so that the time consumption of transform model training or reasoning can be shortened, and the efficiency is improved.

Description

Image processing method and device based on Transformer model

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to an image processing method and device based on a Transformer model.

Background

With the development of image sensors and display technologies, from high definition to ultra-high definition of 4K and 8K, the resolution of images is higher and higher, and the data volume is larger and larger. When image processing is carried out based on a traditional convolution neural network model or a circulation neural network model, if high-resolution original image data are directly input into the model, the image processing process cannot be normally carried out due to the fact that limited computing resources cannot load overlarge data volume; when the original image data is downsampled, the data amount can be reduced, but the detail feature of the image is lost, and the significance of improving the image resolution is lost. It can be seen that the traditional neural network model can not meet the processing requirement of high-resolution images, so a Transformer model with a sequence as input data is introduced into the field of computer vision for image processing.

Joint Photographic Experts Group (JPEG) is a standard for continuous tone still image compression, and has been widely used for its excellent quality of still image compression. At present, when a JPEG image is processed based on a Transformer model, image data stored in a JPEG format needs to be completely decoded to a pixel domain for processing, and the decoding process takes a long time, which causes the problems of long time consumption and low efficiency of the Transformer model in training and reasoning.

Disclosure of Invention

The embodiment of the invention provides an image processing method and device based on a Transformer model, which are used for solving the problems of long time consumption and low efficiency of the Transformer model in training and reasoning in the conventional image processing method based on the Transformer model.

In a first aspect, an embodiment of the present invention provides an image processing method based on a transform model, including:

acquiring JPEG image data;

performing entropy decoding and inverse quantization processing on JPEG image data to obtain frequency domain information corresponding to the JPEG image data, wherein the frequency domain information comprises Discrete Cosine Transform (DCT) coefficients of a Y component, a U component and a V component;

constructing an input sequence which meets the input requirement of a Transformer model according to frequency domain information corresponding to JPEG image data, wherein the Transformer model is constructed based on an attention mechanism;

the input sequence is input into a Transformer model for training and/or reasoning on the Transformer model.

In one embodiment, constructing an input sequence meeting the input requirement of a transform model according to frequency domain information corresponding to JPEG image data includes:

acquiring frequency domain information corresponding to each minimum coding unit from frequency domain information corresponding to JPEG image data;

expanding the frequency domain information corresponding to the minimum coding unit according to the Y component, the U component and the V component in sequence to form a frequency domain feature vector corresponding to the minimum coding unit;

generating a position feature vector corresponding to the minimum coding unit according to the position of the minimum coding unit in the JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector;

and fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the minimum coding units, and arranging according to a first preset sequence to form an input sequence meeting the input requirement of a transform model.

In one embodiment, the size of the minimum coding unit is 16 × 16, the frequency domain information corresponding to the minimum coding unit includes frequency domain information of 4 data units on the Y component, frequency domain information of 1 data unit on the U component, and frequency domain information of 1 data unit on the V component, the size of the data unit is 8 × 8, and the frequency domain information corresponding to the minimum coding unit is expanded according to the Y component, the U component, and the V component in sequence to form a 384-dimensional frequency domain feature vector corresponding to the minimum coding unit.

combining a plurality of adjacent minimum coding units to form a maximum combination unit;

acquiring frequency domain information corresponding to each maximum combination unit from frequency domain information corresponding to JPEG image data;

generating a frequency domain feature vector corresponding to the maximum combination unit according to the frequency domain feature vectors corresponding to the minimum coding units included in the maximum combination unit, wherein the frequency domain feature vector corresponding to the minimum coding unit is formed by expanding frequency domain information corresponding to the minimum coding unit according to the sequence of a Y component, a U component and a V component;

generating a position feature vector corresponding to the maximum combination unit according to the position of the maximum combination unit in JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector;

and fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the maximum combination units, and arranging according to a second preset sequence to form an input sequence meeting the input requirement of the transform model.

In one embodiment, the minimum coding units included in the maximum combination unit are distributed in a square shape in the JPEG image data.

In one embodiment, the JPEG image data includes high definition image data in JPEG format, 4K image data, and 8K image data.

In one embodiment, the Transformer model is used for at least one of image semantic segmentation, target recognition, target detection, image classification, and target tracking of JPEG image data.

In a second aspect, an embodiment of the present invention provides an image processing apparatus based on a transform model, including:

the acquiring module is used for acquiring JPEG image data;

the decoding module is used for carrying out entropy decoding and inverse quantization processing on the JPEG image data to obtain frequency domain information corresponding to the JPEG image data, wherein the frequency domain information comprises Discrete Cosine Transform (DCT) coefficients of a Y component, a U component and a V component;

the building module is used for building an input sequence which meets the input requirement of a Transformer model according to frequency domain information corresponding to JPEG image data, and the Transformer model is built on the basis of an attention mechanism;

and the processing module is used for inputting the input sequence into the Transformer model and training and/or reasoning the Transformer model.

In a third aspect, an embodiment of the present invention provides an image processing apparatus, including:

at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method for transform model-based image processing according to any of the first aspects.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when executed by a processor, the computer-executable instructions are configured to implement the method for image processing based on a transform model according to any one of the first aspect.

According to the image processing method and device based on the transform model, provided by the embodiment of the invention, the frequency domain information corresponding to the JPEG image data is obtained by carrying out entropy decoding and inverse quantization processing on the JPEG image data, the input sequence meeting the transform model input requirement is constructed based on the frequency domain information so as to train and/or reason the transform model, and the image decoding time is effectively shortened by partially decoding the JPEG image data and skipping the steps of inverse DCT (discrete cosine transform) and color space transformation of the most time consuming time and computing resources in the common JPEG decoding process, so that the time consuming for training or reasoning of the transform model can be shortened, and the efficiency is improved.

Drawings

FIG. 1 is a schematic structural diagram of a Transformer model;

FIG. 2 is a schematic diagram of image processing based on a Transformer model;

FIG. 3 is a schematic diagram of JPEG encoding and decoding;

FIG. 4 is a flowchart of an image processing method based on a transform model according to an embodiment of the present invention;

FIG. 5 is a diagram of a minimum coding unit and a maximum combining unit according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method for image processing based on a transform model according to another embodiment of the present invention;

FIG. 7 is a flowchart of a method for image processing based on a transform model according to another embodiment of the present invention;

FIG. 8 is a schematic structural diagram of an image processing apparatus based on a transform model according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).

With the development of image sensors and display technologies, the industry of high-resolution images is more and more abundantly applied in the industries such as intelligent security, city management, industrial internet of things and the like. Especially in recent years, in the rapid development of 5G and 8K technologies, 8K has the advantages of ultra high definition, high frame frequency and wide dynamic, and 5G has the advantages of high bandwidth, low delay and wide coverage connection, so that the artificial intelligence application based on the ultra high definition video is integrated with the internet of things, for example, the building violation detection, the industrial flaw detection, the ultra high definition intelligent camera and the like based on the ultra high definition image of the unmanned aerial vehicle are widely applied.

Traditional neural network models, such as convolutional neural network models, cyclic neural network models, and the like, are mainly applied to spatial domain, that is, to direct processing of Red-Green-Blue (RGB) pixels. In practical applications, particularly when processing high definition video and image data, the data must be downsampled to a predetermined input size of the neural network model. From high definition to ultra high definition images of 4K, 8K resolution, the picture resolution is 1920 × 1080 (about 207 ten thousand pixels), 3840 × 2160 (about 829 ten thousand pixels), and 7680 × 4320 (about 3386 ten thousand pixels), respectively, per frame. If the method is used for a traditional neural network model, the image needs to be scaled and downsampled, and although the downsampling operation can reduce the calculation amount and the required communication bandwidth, abundant detail information in the image can be discarded at the same time, so that the significance of ultra-high definition is lost. Even for the super-resolution image of 8K, excessive down-sampling can result in failure to fully analyze rich semantic information contained in the super-resolution image, and especially, the task of small target detection and fine-grained identification cannot be performed. If the ultra-high-definition images are directly input into the traditional neural network model for training, the model training cannot be normally performed due to the fact that limited computing resources cannot load huge data size, for example, a Graphics Processing Unit (GPU) of NVIDIA commonly used for model training is 16G or 32G in display memory, and a single server cannot meet the training requirement of the ultra-high-definition images.

In order to solve the above problems of the conventional Neural network model in processing image data, a Transformer model is introduced in the field of computer vision, and the network structure of the Transformer model refers to fig. 1, which discards a conventional Convolutional Neural Network (CNN) structure and a Recurrent Neural Network (RNN) structure, and the entire network structure of the Transformer model is completely composed of attention mechanism. The Transformer model can make up the limitation of the traditional neural network model to a certain extent by introducing a multi-head attention mechanism and removing operations such as downsampling and convolution. The sequence is needed for inputting the transform model, so the simplest way is to divide the picture into pixel blocks, then to perform linear transformation and then to pull the pixel blocks into a sequence, and finally to add the position coding information of each pixel block in the picture to form an input sequence, and to input the input sequence into the transform model, and the specific process can refer to fig. 2. It should be noted that the Transformer model referred to in the present application is not limited to the network structure shown in fig. 1, and various modified Transformer models are also included.

JPEG is widely used in its excellent quality, and a large amount of image data is currently stored in the JPEG format. The JPEG encoding and decoding process can refer to fig. 3, and as shown in fig. 3, the JPEG encoding process can sequentially include the following five steps: 1. color space transformation (from RGB color space to YUV color space); 2. down-sampling; 3. DCT transformation; 4. DCT coefficient quantization; 5. and (4) entropy coding. The JPEG decoding process is the inverse of the encoding process, and as shown in fig. 3, the JPEG decoding process may sequentially include the following five steps: 1. entropy decoding; 2. inverse quantization of DCT coefficients; 3. inverse DCT transformation; 4. upsampling; 5. color space conversion (from YUV color space to RGB color space).

If the method shown in fig. 2 is used to train or reason the Transformer model, the JPEG image data needs to be sequentially subjected to entropy decoding, DCT coefficient inverse quantization, inverse DCT transformation, upsampling and color space transformation to obtain a picture in RGB color space, and then the picture is segmented in a space domain and is segmented into a plurality of pixel blocks. In the JPEG decoding process, the inverse DCT and the color space change not only take time, but also need a large amount of computing resources, which causes the problems of long time consumption and low efficiency of the Transformer model in training and reasoning. In order to solve the problem, on the basis of deep research on a Transformer model and JPEG coding, the application provides a JPEG image processing method based on the Transformer model, frequency domain information is directly extracted from a JPEG image data decoding process, and the Transformer model is efficiently trained and inferred based on the frequency domain information. The following will be described in detail by way of specific examples.

Fig. 4 is a flowchart of an image processing method based on a transform model according to an embodiment of the present invention. As shown in fig. 4, the method may include:

s401, JPEG image data is obtained.

In the present embodiment, JPEG image data stored in the JPEG format can be read from the storage device, for example. The JPEG image data in the present embodiment may be high-resolution image data, and may include, for example, high-definition image data in JPEG format, 4K image data, 8K image data, and the like.

S402, carrying out entropy decoding and inverse quantization processing on the JPEG image data to obtain frequency domain information corresponding to the JPEG image data, wherein the frequency domain information comprises Discrete Cosine Transform (DCT) coefficients of a Y component, a U component and a V component.

The JPEG encoding process may include the following five steps: 1) color space conversion: converting the color space of the image from RGB into YUV; 2) down-sampling: the spatial resolution downsampling of the chrominance channels (U and V), typically half the resolution of the luma channel Y, i.e., downsampling U and V such that the length and width of the U and V components are half the Y components; 3) DCT transformation: according to the coding sequence of the zigzag scanning, the Y, U and V components are respectively subjected to DCT (discrete cosine transformation) to obtain frequency domain information; 4) and (3) quantification: the amplitude of the frequency component is subjected to nonlinear quantization, and the storage precision of the high-frequency component is lower than that of the low-frequency component because human vision is more sensitive to small changes of large-area color or brightness than the intensity of high-frequency brightness changes; 5) entropy coding: the size of the data is further reduced using lossless compression algorithms such as huffman coding. Therefore, in this embodiment, after the JPEG image data is subjected to entropy decoding and inverse quantization, the frequency domain information corresponding to the JPEG image data is obtained, and the frequency domain information includes the DCT coefficients of the Y component, the U component, and the V component.

S403, constructing an input sequence which meets the input requirement of a Transformer model according to frequency domain information corresponding to JPEG image data, wherein the Transformer model is constructed based on an attention mechanism.

In this embodiment, after obtaining frequency domain information corresponding to JPEG image data, an input sequence may be directly constructed in the frequency domain based on a frequency domain data block. For example, frequency domain information corresponding to JPEG image data can be divided into a plurality of frequency domain data blocks, then the frequency domain data blocks are subjected to linear transformation and then are pulled into a sequence, and finally position coding information of each frequency domain data block is added to construct an input sequence meeting the input requirement of a transform model.

S404, inputting the input sequence into a Transformer model for training and/or reasoning the Transformer model.

In this embodiment, after the input sequence is constructed, the input sequence may be input into a Transformer model, so as to train and/or reason the Transformer model. The Transformer model in this embodiment may be used to perform at least one of image semantic segmentation, target recognition, target detection, image classification, and target tracking on JPEG image data.

According to the image processing method based on the transform model, frequency domain information corresponding to JPEG image data is obtained by performing entropy decoding and inverse quantization processing on the JPEG image data, an input sequence meeting the transform model input requirement is constructed based on the frequency domain information so as to train and/or reason the transform model, and the JPEG image data is partially decoded, so that the steps of inverse DCT (discrete cosine transform) transformation and color space transformation of the most time consuming time and computing resources in the common JPEG decoding process are skipped, the image decoding time is effectively shortened, the time consuming time of transform model training or reasoning can be shortened, and the efficiency is improved. The image is already in the frequency domain through entropy decoding and inverse quantization and the information quantity is not lost, so that when the method provided by the embodiment is adopted to process the image data with high resolution, abundant detail features in the image can be effectively reserved, and the method is favorable for improving the precision in semantic segmentation, small target detection and fine granularity identification.

Based on the above embodiment, an input sequence conforming to the requirement of the transform model input will be constructed based on the minimum coding unit and the maximum combination unit, respectively, and the minimum coding unit and the maximum combination unit can refer to fig. 5. As shown in fig. 5, a Minimum Coded Unit (MCU) is a square block of 16 × 16 pixels in the original image data, and one MCU corresponds to a Y component of 16 × 16 pixels, a U component of 8 × 8 pixels, and a V component of 8 × 8 pixels in the YUV color space. For the image data stored in the JPEG format, the frequency domain information generated after entropy decoding and inverse quantization is stored according to the MCU, so that the characteristic of the JPEG decoding process can be fully utilized to further improve the efficiency by constructing the input sequence based on the minimum coding unit. For ultrahigh-definition image data such as 4K and 8K, in order to further improve processing efficiency, a plurality of adjacent MCUs may be combined to form a maximum combined Unit (LCU), and one LCU may include, for example, a plurality of adjacent MCUs in a row, a plurality of adjacent MCUs in a column, a plurality of adjacent MCUs distributed in a rectangle, and the like. To match the JPEG decoding process, a plurality of minimum coding units included in the maximum combination unit are distributed in a square shape in the JPEG image data, and one LCU may include 2 × 2 MCUs as shown in fig. 5.

A specific implementation of constructing the input sequence based on the minimum coding unit can refer to fig. 6. As shown in fig. 6, in the image processing method based on a transform model provided in this embodiment, on the basis of the method shown in fig. 4, an input sequence meeting the transform model input requirement is constructed according to frequency domain information corresponding to JPEG image data, which may specifically include:

s601, acquiring frequency domain information corresponding to each minimum coding unit from the frequency domain information corresponding to the JPEG image data.

In this embodiment, the size of the MCU is 16 × 16, the MCU may be further divided according to Data units (Data units, DUs), where a DU is a Data block subjected to 8 × 8DCT transformation, and one MCU includes 6 DUs, where 4 DUs are provided for the Y component and one DU is provided for each of the U component and the V component. That is, the frequency domain information corresponding to the minimum coding unit includes frequency domain information of 4 data units on the Y component, frequency domain information of 1 data unit on the U component, and frequency domain information of 1 data unit on the V component, and the size of the data unit is 8 × 8.

For image data stored in JPEG format, after entropy decoding (e.g., huffman decoding) and inverse quantization, 8 × 8 non-overlapped block step size DCT frequency domain information is generated and stored according to MCUs, where one MCU contains 68 × 8DCT frequency domain information. Therefore, the frequency domain information corresponding to each minimum coding unit can be quickly and efficiently acquired from the frequency domain information corresponding to the JPEG image data.

And S602, expanding the frequency domain information corresponding to the minimum coding unit according to the Y component, the U component and the V component to form a frequency domain feature vector corresponding to the minimum coding unit.

And expanding the frequency domain information of the 68 × 8 DCTs contained in each MCU according to the sequence of the Y component, the U component and the V component to form a 384-dimensional frequency domain feature vector corresponding to the minimum coding unit.

And S603, generating a position feature vector corresponding to the minimum coding unit according to the position of the minimum coding unit in the JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector.

And carrying out position coding on the position of the minimum coding unit in the JPEG image data to generate a position feature vector corresponding to the minimum coding unit, wherein the position feature vector is used for describing the position information of the MCU in the JPEG image. The embodiment does not limit the specific implementation manner of the position encoding. In order to facilitate the fusion of the frequency domain feature vector and the position feature vector, the dimensions of the position feature vector and the frequency domain feature vector are the same in this embodiment.

S604, fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the minimum coding units, and arranging according to a first preset sequence to form an input sequence meeting the input requirement of a Transformer model.

The frequency domain eigenvectors and the position eigenvectors corresponding to each minimum coding unit are fused, for example, the frequency domain eigenvectors and the position eigenvectors may be added, summed, weighted, summed, and the like, and then arranged according to a first preset order, for example, according to the "z" word scanning order of the MCU.

The following describes the method provided by this embodiment in detail, taking high-definition (1080p) image data (resolution 1920 × 1080) as an example:

the JPEG image of 1080p does not need complete JPEG decoding, corresponding to a YUV color space, 8100 MCU coding information is generated in the JPEG decoding process, wherein one MCU generates 6 frequency domain information of 8 multiplied by 8 DCT.

In the JPEG decoding process, the frequency domain coefficients of 6 DUs generated by entropy decoding and inverse quantization of each MCU are flattened according to the sequence of YUV color components to form 384-dimensional frequency domain feature vectors, and the frequency domain feature vectors of 8100 MCUs form a frequency domain sequence of [8100,384 ]. I.e. the frequency domain feature vector of one MCU corresponds to one row vector in the frequency domain sequence.

And (3) carrying out position coding on the positions of the 16 multiplied by 16 pixel blocks corresponding to each MCU in the JPEG image to form 384-dimensional position characteristic vectors, wherein the position characteristic vectors of 8100 MCUs form a position coding sequence of [8100,384 ]. I.e. the position feature vector of an MCU corresponds to a row vector in the position-coded sequence. And adding the frequency domain sequence of [8100,384] and the position coding sequence of [8100,384] to form an input sequence of [8100,384], and finally inputting the input sequence of [8100,384] dimensions into a Transformer model to finish model training or reasoning.

Based on the above embodiment, the image processing method based on the transform model further constructs an input sequence meeting the transform model input requirement by using the MCU as a minimum unit, and makes full use of the characteristic that the MCU stores frequency domain information in the JPEG decoding process, thereby contributing to further shortening the time consumption for training or reasoning of the transform model and improving the efficiency.

When a 4K (resolution 3840 multiplied by 2160) ultra-high-definition JPEG image is processed, the coding information of 32400 MCUs is generated in the JPEG decoding process, and the input sequence is constructed based on the MCUs, so that the input sequence is too long, and the training or reasoning efficiency of a Transformer model is reduced. In order to adapt to the task of processing the ultra-high-definition JPEG image, the embodiment provides a method for constructing an input sequence based on a maximum combination unit, and a specific implementation manner of the method may refer to fig. 7. As shown in fig. 7, in the image processing method based on a transform model provided in this embodiment, on the basis of the method shown in fig. 4, an input sequence meeting the transform model input requirement is constructed according to frequency domain information corresponding to JPEG image data, which may specifically include:

s701, combining a plurality of adjacent minimum coding units to form a maximum combination unit.

In this embodiment, the number of the minimum coding units included in the maximum combination unit may be adjusted according to the resolution of the image, and when the resolution of the image is larger, the number of the minimum coding units included in the maximum combination unit is also larger, that is, the number of the minimum coding units included in the maximum combination unit is positively correlated with the resolution of the image, so as to adapt to a task of processing an ultra-high definition image. In an alternative embodiment, the minimum coding units included in the maximum combination unit are distributed in a square shape in the JPEG image data to fully utilize the image correlation between adjacent MCUs.

S702, acquiring frequency domain information corresponding to each maximum combination unit from the frequency domain information corresponding to the JPEG image data.

The frequency domain information corresponding to the JPEG image data is stored according to the MCU, so that the frequency domain information corresponding to the maximum combination unit can be determined only by acquiring the frequency domain information corresponding to each minimum coding unit included in the maximum combination unit.

And S703, generating a frequency domain feature vector corresponding to the maximum combination unit according to the frequency domain feature vectors corresponding to the minimum coding units included in the maximum combination unit, wherein the frequency domain feature vector corresponding to the minimum coding unit is formed by expanding the frequency domain information corresponding to the minimum coding unit according to the sequence of the Y component, the U component and the V component.

And S704, generating a position feature vector corresponding to the maximum combination unit according to the position of the maximum combination unit in the JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector.

S705, fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the maximum combination units, and arranging according to a second preset sequence to form an input sequence meeting the input requirement of the transform model.

The method provided by the present embodiment is described in detail below by taking 4K ultra high definition image data (resolution 3840 × 2160, about 829 ten thousand pixels) as an example:

for 4K JPEG image, complete JPEG decoding is not needed, and the coded information of 32400(8100 x 4) MCUs is generated in the JPEG decoding process corresponding to a YUV color space, wherein one MCU generates 6 frequency domain information of 8x8 DCT.

In the JPEG decoding process, frequency domain coefficients of 6 DU units generated by entropy decoding and inverse quantization of each MCU are flattened according to the sequence of YUV color components to form 384-dimensional sub-frequency domain feature vectors. According to the resolution size of the 4K image, the LCU in this embodiment includes adjacent 2 × 2 MCUs, and the corresponding frequency domain feature vector dimension of the LCU is 1536(384 × 4) dimensions, so as to form a frequency domain sequence of [8100,1536] composed of 8100 LCUs.

Position coding is carried out on the positions, located in a JPEG image, of 64 x 64 pixel blocks corresponding to each LCU to form position feature vectors of 1536 dimensions, position feature vectors of 8100 LCUs form position coding sequences of [8100,1536], then the frequency domain sequences and the position coding sequences are added to form input sequences of [8100,1536] dimensions, and finally the input sequences of [8100 and 1536] dimensions are input into a Transformer model to finish model training or reasoning.

In the image processing method based on the Transformer model provided in this embodiment, on the basis of the above embodiment, a maximum combination unit is formed by combining a plurality of adjacent minimum coding units, and an input sequence conforming to the input requirement of the Transformer model is further constructed with an LCU as a minimum unit, which is more suitable for processing ultra-high definition image data.

Fig. 8 is a schematic structural diagram of an image processing apparatus based on a transform model according to an embodiment of the present invention. As shown in fig. 8, the image processing apparatus 80 based on the transform model provided in the present embodiment may include: an acquisition module 801, a decoding module 802, a construction module 803, and a processing module 804.

An obtaining module 801, configured to obtain JPEG image data;

the decoding module 802 is configured to perform entropy decoding and inverse quantization processing on the JPEG image data to obtain frequency domain information corresponding to the JPEG image data, where the frequency domain information includes discrete cosine transform DCT coefficients of a Y component, a U component, and a V component;

the building module 803 is configured to build an input sequence meeting the input requirement of a transform model according to frequency domain information corresponding to JPEG image data, where the transform model is built based on an attention mechanism;

a processing module 804, configured to input the input sequence into the Transformer model, and to train and/or reason the Transformer model.

The image processing apparatus based on the transform model provided in this embodiment may be used to execute the technical solution of the method embodiment corresponding to fig. 4, and the implementation principle and the technical effect are similar, which are not described herein again.

In an optional implementation manner, the constructing module 803 is specifically configured to obtain frequency domain information corresponding to each minimum coding unit from frequency domain information corresponding to JPEG image data; expanding the frequency domain information corresponding to the minimum coding unit according to the Y component, the U component and the V component in sequence to form a frequency domain feature vector corresponding to the minimum coding unit; generating a position feature vector corresponding to the minimum coding unit according to the position of the minimum coding unit in the JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector; and fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the minimum coding units, and arranging according to a first preset sequence to form an input sequence meeting the input requirement of a transform model.

Optionally, the size of the minimum coding unit is 16 × 16, the frequency domain information corresponding to the minimum coding unit includes frequency domain information of 4 data units on the Y component, frequency domain information of 1 data unit on the U component, and frequency domain information of 1 data unit on the V component, the size of the data unit is 8 × 8, and the frequency domain information corresponding to the minimum coding unit is expanded according to the Y component, the U component, and the V component in sequence to form a 384-dimensional frequency domain feature vector corresponding to the minimum coding unit.

In an optional implementation manner, the constructing module 803 is specifically configured to combine a plurality of adjacent minimum coding units to form a maximum combining unit; acquiring frequency domain information corresponding to each maximum combination unit from frequency domain information corresponding to JPEG image data; generating a frequency domain feature vector corresponding to the maximum combination unit according to the frequency domain feature vectors corresponding to the minimum coding units included in the maximum combination unit, wherein the frequency domain feature vector corresponding to the minimum coding unit is formed by expanding frequency domain information corresponding to the minimum coding unit according to the sequence of a Y component, a U component and a V component; generating a position feature vector corresponding to the maximum combination unit according to the position of the maximum combination unit in JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector; and fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the maximum combination units, and arranging according to a second preset sequence to form an input sequence meeting the input requirement of the transform model.

Optionally, the minimum coding units included in the maximum combination unit are distributed in a square shape in the JPEG image data.

Optionally, the JPEG image data includes high definition image data in JPEG format, 4K image data, and 8K image data.

Optionally, the Transformer model is used for performing at least one of image semantic segmentation, target recognition, target detection, image classification, and target tracking on the JPEG image data.

Fig. 9 is a schematic view showing an image processing apparatus according to an embodiment of the present invention, which is only illustrated in fig. 9, and the embodiment of the present invention is not limited thereto. Fig. 9 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention. As shown in fig. 9, the image processing apparatus 90 provided in the present embodiment may include: memory 901, processor 902, and bus 903. The bus 903 is used for connection between the elements.

The memory 901 stores a computer program, and when the computer program is executed by the processor 902, the computer program can implement the technical solution of the image processing method based on the transform model provided by any of the above method embodiments.

Wherein the memory 901 and the processor 902 are electrically connected directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines, such as bus 903. The memory 901 stores a computer program for implementing the image processing method based on the transform model, which includes at least one software functional module that can be stored in the memory 901 in the form of software or firmware, and the processor 902 executes various functional applications and data processing by running the software program and module stored in the memory 901.

The Memory 901 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 901 is used for storing programs, and the processor 902 executes the programs after receiving execution instructions. Further, the software programs and modules in the memory 901 may also include an operating system, which may include various software components and/or drivers for managing system tasks (e.g., memory management, storage device control, power management, etc.), and may communicate with various hardware or software components to provide an operating environment for other software components.

The processor 902 may be an integrated circuit chip having signal processing capabilities. The Processor 902 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and so on. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. It will be appreciated that the configuration of fig. 9 is merely illustrative and may include more or fewer components than shown in fig. 9 or have a different configuration than shown in fig. 9. The components shown in fig. 9 may be implemented in hardware and/or software.

Reference is made herein to various exemplary embodiments. However, those skilled in the art will recognize that changes and modifications may be made to the exemplary embodiments without departing from the scope hereof. For example, the various operational steps, as well as the components used to perform the operational steps, may be implemented in differing ways depending upon the particular application or consideration of any number of cost functions associated with operation of the system (e.g., one or more steps may be deleted, modified or incorporated into other steps).

Additionally, as will be appreciated by one skilled in the art, the principles herein may be reflected in a computer program product on a computer readable storage medium, which is pre-loaded with computer readable program code. Any tangible, non-transitory computer-readable storage medium may be used, including magnetic storage devices (hard disks, floppy disks, etc.), optical storage devices (CD-ROMs, DVDs, Blu Ray disks, etc.), flash memory, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including means for implementing the function specified. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims

1. An image processing method based on a Transformer model is characterized by comprising the following steps:

acquiring JPEG image data;

entropy decoding and inverse quantization processing are carried out on the JPEG image data to obtain frequency domain information corresponding to the JPEG image data, wherein the frequency domain information comprises Discrete Cosine Transform (DCT) coefficients of a Y component, a U component and a V component;

constructing an input sequence which meets the input requirement of a Transformer model according to frequency domain information corresponding to the JPEG image data, wherein the Transformer model is constructed based on an attention mechanism;

inputting the input sequence into the Transformer model for training and/or reasoning the Transformer model.

2. The method of claim 1, wherein constructing an input sequence conforming to transform model input requirements from frequency domain information corresponding to the JPEG image data comprises:

acquiring frequency domain information corresponding to each minimum coding unit from the frequency domain information corresponding to the JPEG image data;

and fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the minimum coding units, and arranging according to a first preset sequence to form an input sequence meeting the input requirement of the Transformer model.

3. The method of claim 2, wherein the size of the minimum coding unit is 16x16, the frequency domain information corresponding to the minimum coding unit includes frequency domain information of 4 data units on the Y component, frequency domain information of 1 data unit on the U component, and frequency domain information of 1 data unit on the V component, and the size of the data unit is 8x8, and the frequency domain information corresponding to the minimum coding unit is expanded in the order of the Y component, the U component, and the V component to form a 384-dimensional frequency domain feature vector corresponding to the minimum coding unit.

4. The method of claim 1, wherein constructing an input sequence conforming to transform model input requirements from frequency domain information corresponding to the JPEG image data comprises:

acquiring frequency domain information corresponding to each maximum combination unit from frequency domain information corresponding to the JPEG image data;

generating a frequency domain feature vector corresponding to the maximum combination unit according to the frequency domain feature vector corresponding to each minimum coding unit included in the maximum combination unit, wherein the frequency domain feature vector corresponding to the minimum coding unit is formed by expanding the frequency domain information corresponding to the minimum coding unit according to the Y component, the U component and the V component in sequence;

generating a position feature vector corresponding to the maximum combination unit according to the position of the maximum combination unit in the JPEG image data, wherein the dimension of the position feature vector is the same as that of the frequency domain feature vector;

and fusing the frequency domain characteristic vectors and the position characteristic vectors corresponding to the maximum combination units, and arranging according to a second preset sequence to form an input sequence meeting the input requirement of the Transformer model.

5. The method of claim 4, wherein the plurality of the minimum coding units included in the maximum combination unit are distributed in a square shape in the JPEG image data.

6. The method of claim 1, wherein the JPEG image data comprises JPEG-formatted high definition image data, 4K image data, and 8K image data.

7. The method of any of claims 1-6, wherein the Transformer model is used to perform at least one of image semantic segmentation, target recognition, target detection, image classification, and target tracking on the JPEG image data.

8. An image processing apparatus based on a Transformer model, comprising:

the acquiring module is used for acquiring JPEG image data;

the building module is used for building an input sequence which meets the input requirement of a Transformer model according to the frequency domain information corresponding to the JPEG image data, and the Transformer model is built on the basis of an attention mechanism;

9. An image processing apparatus characterized by comprising: at least one processor and memory;

the memory is used for storing programs;

the at least one processor is configured to implement the method for transform model-based image processing according to any one of claims 1 to 7 by executing the program stored in the memory.

10. A computer-readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the method for transform model-based image processing according to any one of claims 1 to 7.