WO2022183346A1 - 特征数据的编码方法、解码方法、设备及存储介质 - Google Patents
特征数据的编码方法、解码方法、设备及存储介质 Download PDFInfo
- Publication number
- WO2022183346A1 WO2022183346A1 PCT/CN2021/078550 CN2021078550W WO2022183346A1 WO 2022183346 A1 WO2022183346 A1 WO 2022183346A1 CN 2021078550 W CN2021078550 W CN 2021078550W WO 2022183346 A1 WO2022183346 A1 WO 2022183346A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature data
- feature
- channel
- splicing
- multiple channels
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 105
- 230000003247 decreasing effect Effects 0.000 claims abstract description 15
- 108091006146 Channels Proteins 0.000 claims description 401
- 238000012545 processing Methods 0.000 claims description 33
- 230000002123 temporal effect Effects 0.000 claims description 17
- 230000001186 cumulative effect Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 238000003062 neural network model Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 2
- 238000012163 sequencing technique Methods 0.000 abstract 2
- 238000005516 engineering process Methods 0.000 description 21
- 238000004458 analytical method Methods 0.000 description 18
- 238000010586 diagram Methods 0.000 description 17
- 238000013528 artificial neural network Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 13
- 238000007906 compression Methods 0.000 description 11
- 230000006835 compression Effects 0.000 description 11
- 238000013139 quantization Methods 0.000 description 7
- 238000013144 data compression Methods 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 6
- 238000011160 research Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000000153 supplemental effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000007907 direct compression Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000005294 ferromagnetic effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000012857 repacking Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/103—Selection of coding mode or of prediction mode
- H04N19/107—Selection of coding mode or of prediction mode between spatial and temporal predictive coding, e.g. picture refresh
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/119—Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/593—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
Definitions
- the embodiments of the present disclosure relate to encoding and decoding technologies in the field of communications, and in particular, to an encoding method, a decoding method, an encoder, a decoder, and a storage medium for feature data.
- the feature map encoding and decoding process includes three main modules: pre-quantization/de-pre-quantization, repackaging/de-repackaging, and traditional video encoding/decoding.
- the pre-quantized and repacked feature map array data is sent to the traditional video encoder in the form of luminance chrominance (YUV) video data for compression encoding, and the code stream generated by the traditional video encoder is included in the feature map data stream.
- YUV luminance chrominance
- there are multiple modes for repackaging/anti-repackaging which are specified order stacking of feature maps, default order of feature maps, or specified order tiling.
- the multi-channel data of the feature is tiled in an image in a single list order, and the multi-channel data are closely adjacent, which leads to the encoding of the tiled image in the processing method of the existing feature data.
- the block division operation divides the data of multiple channels into the same coding unit. Due to the discontinuity between different channel data, the correlation of different channel data in the same coding unit is poor, so that the efficiency of the existing feature data processing method cannot be effectively utilized.
- the embodiments of the present disclosure provide an encoding method, a decoding method, an encoder, a decoder, and a storage medium for feature data.
- a decoding method By sorting all channel feature data, the correlation between adjacent channels in the time-space domain after sorting is relatively large. , so that the feature data channel with higher similarity in adjacent regions can be referred to in subsequent coding, thereby improving the coding efficiency of feature data.
- an embodiment of the present disclosure provides a method for encoding feature data, including:
- the target feature frame sequence is encoded to generate a code stream.
- an embodiment of the present disclosure also provides a method for decoding feature data, including:
- Reverse sorting is performed on the reconstructed feature frame sequence to obtain reconstructed feature data of multiple channels.
- an embodiment of the present disclosure provides an encoder, the encoder includes a first obtaining unit, a first processing unit, and an encoding unit; wherein,
- the first obtaining unit is configured to obtain feature data of multiple channels corresponding to the image to be processed
- the first processing unit configured to determine the feature data of the reference channel in the feature data of the multiple channels
- the first processing unit is configured to take the feature data of the reference channel as a sorting start object, and sort the feature data of the multiple channels in an order of decreasing similarity between the feature data of the multiple channels, Obtain the feature data of the sorted multiple channels;
- the first processing unit is configured to splicing the feature data of the sorted multiple channels to obtain a target feature frame sequence
- the encoding unit is configured to encode the target feature frame sequence to generate a code stream.
- an embodiment of the present disclosure provides a decoder, the decoder includes a decoding unit and a second processing unit; wherein,
- the decoding unit is configured to parse the code stream to obtain the reconstructed feature frame sequence
- the second processing unit is configured to reversely sort the reconstructed sequence of feature frames to obtain reconstructed feature data of multiple channels.
- an encoder including:
- the first memory for storing a computer program executable on the first processor
- the first processor is configured to execute the encoding method according to the first aspect when running the computer program.
- an embodiment of the present disclosure further provides a decoder, including:
- the second memory for storing a computer program executable on the second processor
- the second processor is configured to execute the decoding method according to the second aspect when running the computer program.
- an embodiment of the present disclosure provides a computer-readable storage medium storing executable encoding instructions for causing the first processor to execute the encoding method described in the first aspect.
- an embodiment of the present disclosure provides a computer-readable storage medium storing executable decoding instructions for causing the second processor to execute the decoding method described in the second aspect.
- Embodiments of the present disclosure provide a feature data encoding method, decoding method, encoder, decoder, and storage medium, wherein the feature data encoding method obtains feature data of multiple channels corresponding to an image to be processed;
- the feature data of the reference channel in the feature data of the channel take the feature data of the reference channel as the sorting starting object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain The sorted feature data of multiple channels; splicing the sorted feature data of multiple channels to obtain a target feature frame sequence; encoding the target feature frame sequence to generate a code stream;
- the feature data of one channel is used as the benchmark, that is, the feature data of the reference channel is determined; according to the order of similarity compared with the feature data of the reference channel, the features of all channels are analyzed in descending order of similarity. In this way, the correlation between the feature data of adjacent channels in the spatiotemporal domain after sorting is relatively large,
- FIG. 1 is a schematic diagram of a "pre-analysis and recompression" framework provided by an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of an encoding process in a related art provided by an embodiment of the present disclosure
- FIG. 3 is a schematic diagram of an encoding process in a related art provided by an embodiment of the present disclosure
- FIG. 4 is a schematic diagram of spatiotemporal splicing in the related art provided by an embodiment of the present disclosure
- FIG. 5 is a schematic flowchart 1 of an exemplary method for encoding feature data according to an embodiment of the present disclosure
- FIG. 6 is a second schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
- FIG. 7 is a third schematic flowchart of an exemplary method for encoding feature data provided by an embodiment of the present disclosure.
- FIG. 8 is a fourth schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
- FIG. 9 is a schematic flowchart 5 of an exemplary method for encoding feature data according to an embodiment of the present disclosure.
- FIG. 10 is a sixth schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
- FIG. 11 is a schematic diagram of raster scan stitching provided by an embodiment of the present disclosure.
- FIG. 12 is a schematic diagram of zigzag scanning and splicing provided by an embodiment of the present disclosure
- FIG. 13 is a seventh schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
- FIG. 14 is a schematic diagram of filling between feature data of adjacent channels in a space provided by an embodiment of the present disclosure
- FIG. 15 is a schematic flowchart 1 of an exemplary method for decoding feature data according to an embodiment of the present disclosure
- 16 is a second schematic flowchart of an exemplary method for decoding feature data according to an embodiment of the present disclosure
- FIG. 17 is a third schematic flowchart of an exemplary method for decoding feature data according to an embodiment of the present disclosure.
- FIG. 19 is a schematic structural diagram of a decoder according to an embodiment of the present disclosure.
- 20 is a schematic structural diagram of an encoder according to an embodiment of the present disclosure.
- FIG. 21 is a schematic structural diagram of a decoder according to an embodiment of the present disclosure.
- first ⁇ second ⁇ third involved in the embodiments of the present disclosure is only to distinguish similar objects, and does not represent a specific ordering of objects. It can be understood that “first ⁇ second ⁇ third” "Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the disclosure described herein to be practiced in sequences other than those illustrated or described herein.
- the three-dimensional feature data tensor (3D Feature Data Tensor) includes the number of channels (Channel, C), height (Height, H), and width (Width, W).
- Feature data which refers to the output data at the intermediate layer of neural networks.
- videos and images are also used to analyze and understand the semantic information in them.
- the traditional direct compression and coding of images is converted to the compression and coding of feature data output by the middle layer of the intelligent analysis task network.
- End-side devices such as cameras first use the task network to pre-analyze the raw video and image data collected or input, extract enough feature data for cloud analysis, and compress, encode and transmit these feature data.
- the cloud device After the cloud device receives the corresponding code stream, it reconstructs the corresponding feature data according to the syntax information of the code stream, and inputs it into the specific task network for further analysis.
- This "pre-analysis and recompression" framework is shown in Figure 1. Under this framework, there is a large amount of characteristic data transmission between the terminal-side device and the cloud device. The purpose of characteristic data compression is to compress and encode the characteristic data extracted from the existing task network in a recoverable way for the cloud to further intelligent analytical processing.
- the neural network includes at least one neural network.
- the neural network includes task network A, task network B, and task network C. These neural networks may be the same or different. Taking the neural network with 10 layers as an example, because the local computing power of the image acquisition device is not enough, only 5 layers can be executed.
- the image acquisition device processes the original feature data to obtain the corresponding features.
- Characteristic data of the data input condition of the encoding device further, the image acquisition device sends the characteristic data that meets the input condition to the encoding device, and the encoding device encodes the characteristic data that meets the input condition and writes it into the code stream.
- the encoding device sends the code stream to the feature decoding device, where the feature decoding device may be set in a cloud device such as a cloud server. That is to say, after the end-side device obtains the code stream, it will be handed over to the cloud server for processing.
- the cloud server decodes and reconstructs the code stream through the feature decoding device, and obtains the reconstructed feature data; finally, the cloud server inputs the reconstructed feature data corresponding to each channel to the sixth layer of the neural network, and continuously executes to the tenth layer. get the recognition result.
- the Moving Picture Experts Group (MPEG) established the Video Coding for Machines at its 127th meeting in July 2019.
- VCM Video Coding for Machines at its 127th meeting in July 2019.
- VCM standard working group to study the technology in this area, aiming to define a code stream for compressed video or feature information extracted from video, so that it can use the same code stream without significantly reducing the performance of intelligent task analysis. Perform multiple intelligent analysis tasks, and the decompressed information is more friendly to intelligent analysis tasks, and the performance loss of intelligent analysis tasks is smaller under the same bit rate.
- the VCM standard working group has designed the corresponding potential coding flow chart, as shown in Figure 2, in order to improve the coding efficiency of video and images under intelligent analysis tasks.
- the video and image can directly pass through the video and image encoder optimized for the task, or use network pre-analysis to extract feature data and encode it, and then input the decoded feature data into the subsequent network for further analysis.
- it is necessary to multiplex the existing video and image coding standards to compress the extracted feature data it is necessary to perform fixed-point processing on the feature data represented by floating point, and at the same time convert it into an input suitable for the existing coding and decoding standards. For example, multi-channel feature data is spliced into a single-frame or multi-frame YUV format feature frame sequence and input into a video encoder for compression encoding.
- the feature data compression technology output by the middle layer of the task network is worthy of in-depth study in the encoding and decoding process.
- the coding efficiency of the feature data output by different levels of some commonly used task networks in lossless compression and lossy compression is studied.
- the reference software of the video coding standard H.265/HEVC to compress and encode the feature data
- the applicant believes that the signal fidelity of the feature data is not much different in a larger code rate range, and when the code rate is lower than At a certain threshold, the signal fidelity of the feature data decreases sharply.
- the research utilizes existing video coding standards to compress feature data lossy, and by introducing lossy compression into network training, a strategy to improve task accuracy during lossy compression is proposed.
- the task accuracy is used as the evaluation index, and the compressed feature data can obtain higher performance than the target data in some cases. Therefore, the three tasks of image classification, image retrieval and image recognition are established. corresponding evaluation indicators. Since the task network may be over-fitted or under-fitted after training, the task performance may be higher than the target performance when the feature data rate is high, and a set of evaluation indicators is established for each task. The adaptability is poor, so an appropriate code rate interval can be selected, that is, the coding efficiency of the feature data can be measured without considering the situation that the code rate is too high and the task performance is too low.
- neural networks can also be used to reduce the dimension of feature data to achieve the goal of compressing the amount of data.
- the feature data compression technology studied is only for special application scenarios, only for a few task networks whose intermediate layer data volume is smaller than the target input, and for other large-scale task networks.
- the feature data output by most task networks is poor;
- the second is that the research method only considers the data volume after feature data compression and does not consider the task quality. For example, using neural networks to reduce the dimension of feature data, it is difficult to achieve high precision.
- the third is to combine the traditional video and image coding technology to compress the feature data without considering the feature data. Different from the traditional video and image, the existing video and image coding technology cannot be efficiently used to achieve higher coding efficiency.
- FIG. 3 The feature data encoding and decoding process is shown in FIG. 3 , which includes three main modules: pre-quantization/de-pre-quantization; repackaging/de-repackaging; and traditional video encoding/decoding.
- the specific module contents are as follows.
- Pre-quantization/Inverse pre-quantization When the target input feature map is a floating point type, the feature map needs to be pre-quantized to convert it into integer data that meets the input requirements of traditional video encoders.
- the repackaging module transforms the three-dimensional array of target feature maps into YUV format information that meets the input requirements of traditional video encoders. At the same time, by changing the combination method of the feature maps, the coding efficiency of the feature map data of the traditional video encoder is improved.
- each channel of the feature map corresponds to a frame in the input data of a traditional video encoder.
- the height and width of the feature map are padded to meet the input requirements of the video encoder.
- the feature map channel order is stored in the repacking order list (repack_order_list) about the feature map, where the content in the repack_order_list can default to the default order array (for example, [0, 1, 2, 3...]).
- the order of feature channels is not optimally arranged according to the correlation between feature channels, and the order of feature channels in the video codec is not optimized.
- the reference relationship is used for guidance and design, which makes the coding efficiency between the feature channels after stacking not high.
- Feature maps are tiled in default order or specified order:
- multiple channels of feature maps are tiled and spliced into a two-dimensional array as a frame in the input data of a traditional video encoder.
- the height and width of the concatenated array are padded to meet the input requirements of traditional video encoders.
- the splicing order is the channel order of the target feature map.
- the array width direction is first, and the height direction is second. After the current frame is full, create the next frame and continue tiling until all channels of the feature map are tiled.
- the content can default to the default order array (for example, [0,1,2,3...]).
- the multi-channel data of the feature is tiled in an image according to a single list order, and the multi-channel data are closely adjacent, which leads to the fact that when using the existing codec methods to encode the tiled image,
- the block division operation divides the data of multiple channels into the same coding unit. Due to the discontinuity between data of different channels, the correlation of data of different channels in the same coding unit is poor, so that the efficiency of the existing coding and decoding methods cannot be effectively utilized, and the compression effect of feature data is not good enough.
- the pre-quantized and repackaged feature map array data is sent to the traditional video encoder in the form of YUV video data for compression encoding, and the code stream generated by the traditional video encoder is included in the feature map data stream.
- the feature map array is input in YUV4:0:0 format; for AVS3 video encoder, the feature map array is input in YUV4:2:0 format.
- MPEG Immersive Video there is a technology that re-expresses and rearranges the image content captured by each camera at the same time, so that the visual information can be expressed efficiently and effectively.
- coding Specifically, in the Motion Picture Experts Group immersive video, multiple cameras are placed in a certain positional relationship in the scene to be shot, and these cameras are also called reference viewpoints. There is a certain visual redundancy between the content captured by each reference viewpoint. Therefore, the images of all reference viewpoints need to be re-expressed and re-organized at the encoding end to remove the visual redundancy between viewpoints; at the decoding end, the re-expression and The reorganized information is parsed and restored.
- the way to re-express the image of the reference viewpoint is to cut out rectangular-shaped sub-block images (Patch) of different sizes from the image of the reference viewpoint. After cutting out all necessary sub-block images, sort these sub-block images from large to small.
- the sub-block images are placed one by one on an image with a larger resolution to be filled, and the image to be filled is called an Atlas.
- the upper left pixel of each sub-block image must fall on the upper left pixel of the divided 8*8 image block in the image to be filled.
- the pixels inside the sub-block images placed in the atlas will be sorted according to the order of placing the sub-block images recorded in the sub-block image information list.
- Rendering is performed one by one to synthesize an image of the viewer's viewpoint.
- the scheme of re-expression and rearrangement of visual information in the immersive video of the Moving Picture Experts Group is only placed in order according to the strategy of sorting the sub-block image areas from large to small.
- the texture similarity and spatial position similarity between sub-blocks are not considered, which will lead to the fact that the reorganized atlas images cannot give full play to the existing encoding and decoding methods when they are sent to traditional video codecs. efficiency.
- the spatiotemporal splicing method of feature data for similarity measurement is based on the visual geometry group (Visual Geometry Group, VGG) under the image recognition task and the multi-channel output of the intermediate layer of the residual ResNet network.
- VGG Visual Geometry Group
- the feature data is compressed and encoded by multiplexing the existing video coding standard H.265/HEVC, and the coding efficiency can be improved by an average of 2.27% compared with the spatial arrangement method.
- the feature data output from a specific level is currently spliced into two frames in channel order, the mean square error (MSE) is used to measure the similarity between the two frames, and the feature data channels of the two frames are exchanged iteratively. And calculate the similarity between the two frames, and finally obtain an arrangement with the largest similarity between the two frames, and transmit the list corresponding to the target channel sequence and the new channel sequence to the decoding end.
- MSE mean square error
- the feature data arrangement of the target is recovered by using the list corresponding to the target channel and the new channel arrangement order, and then input to the subsequent task network to continue inference analysis.
- the similarity is maximized by exchanging the feature data channels between the two frames.
- the correlation between the feature data channels in the same frame is not considered, and the multiple The arrangement method at frame time makes the feature data not make full use of the correlation between different channels during encoding to achieve the best encoding efficiency.
- the present disclosure provides a technology for sorting, splicing, encoding and decoding in the spatiotemporal domain.
- the basic idea of this technology is: in the preprocessing stage, the multi-channel feature data output by the middle layer of the neural network is sorted, and the feature data of each channel is spliced into multi-frame features in a specific way in the time and space domain according to the sorted order. frame sequence.
- the feature frame sequence is encoded with the optimized inter-frame reference structure, and the key information of the preprocessing is encoded to obtain the final code stream.
- the decoding stage from the received code stream, the reconstructed feature frame sequence and the reconstructed preprocessing key information are obtained by parsing.
- the post-processing stage according to the reconstructed preprocessing key information, the reconstructed feature frame sequence is post-processed to obtain the reconstructed feature data, and the reconstructed feature data is used for the subsequent network for further task reasoning analysis.
- An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 5 , the method includes the following steps:
- Step 501 Acquire feature data of multiple channels corresponding to the image to be processed.
- step 501 to obtain feature data of multiple channels corresponding to the image to be processed may be implemented by the following steps: obtaining the image to be processed; and extracting features from the image to be processed through a neural network model to obtain feature data of multiple channels.
- the encoder after acquiring the to-be-processed image, the encoder inputs the to-be-processed image into the neural network model, and then acquires the feature data of each channel output by the middle layer of the neural network model.
- each channel of the image is each feature map of the image, one channel is the detection of a certain feature, and the strength of a certain value in the channel is the response to the strength of the current feature.
- Step 502 Determine the feature data of the reference channel among the feature data of the multiple channels.
- the feature data of the reference channel may be the feature data of any channel among the feature data of multiple channels.
- the purpose of determining the feature data of the reference channel is to determine a sorting start object when sorting the feature data of multiple channels subsequently.
- Step 503 Take the feature data of the reference channel as the sorting start object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
- the feature data of the reference channel when the feature data of the reference channel is determined, the feature data of the reference channel is used as the starting object for sorting, and the feature data of the multiple channels are sorted in the order of decreasing similarity between the feature data of the multiple channels. Sorting the data, that is, sorting the feature data of all channels according to the order of similarity from large to small compared with the feature data of the reference channel, and obtaining the feature data of the sorted channels. It should be noted that the correlation of feature data between adjacent channels in the spatiotemporal domain after sorting is relatively large.
- Step 504 splicing the sorted feature data of multiple channels to obtain a target feature frame sequence.
- the feature data of multiple channels is sorted according to the similarity, and then in the time domain and the space domain according to the sorting order, Alternatively, the target feature frame sequence can be arranged in the spatial domain, so that feature data with high similarity in adjacent regions can be referred to during subsequent coding, thereby improving the coding efficiency of the feature data.
- Step 505 Encode the target feature frame sequence to generate a code stream.
- the inter-frame coding technology can be better used.
- the feature data is encoded, and if the splicing is performed in the spatial domain and then in the time domain, the feature data can be better encoded using the intra-frame coding technology, so that the technology in the existing video coding standards can be reused. Coding efficiently.
- the encoding method of feature data provided by the embodiment of the present disclosure, by acquiring feature data of multiple channels corresponding to the image to be processed; determining feature data of a reference channel among the feature data of multiple channels; sorting by feature data of the reference channel For the starting object, sort the feature data of multiple channels in the order of decreasing similarity between the feature data of multiple channels to obtain the sorted feature data of multiple channels; put the sorted feature data of multiple channels Perform splicing to obtain a target feature frame sequence; encode the target feature frame sequence to generate a code stream; that is, in the present disclosure, in the case of obtaining feature data of multiple channels, the feature data of one channel is used as a benchmark, that is, to determine The feature data of the reference channel; the feature data of all channels are sorted according to the order of similarity compared with the feature data of the reference channel; in this way, between the feature data of adjacent channels in the space-time domain after sorting
- the correlation is large, so that the feature data channel with higher similarity in adjacent regions can be referred to in subsequent
- An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 6 , the method includes the following steps:
- Step 601 Acquire feature data of multiple channels corresponding to the image to be processed.
- Step 602 When the accumulated sum of the feature data values in the feature data of the multiple channels satisfies the target threshold, determine that the feature data of the channel corresponding to the accumulated sum is the feature data of the reference channel.
- the cumulative sum of the characteristic data values satisfying the target threshold includes: the cumulative sum of the characteristic data values is the largest, or the cumulative sum of the characteristic data values is the smallest.
- the coding efficiency can be improved.
- Step 603 taking the feature data of the reference channel as the sorting start object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
- the similarity between the feature data of the remaining channels and the feature data of the current channel can be determined based on an iterative algorithm. Difference, SAD) and/or mean squared error (Mean Squared Error, MSE); thus, the feature data of a channel with the largest similarity is selected as the feature data of the next channel after sorting.
- SAD Difference
- MSE mean squared Error
- Step 604 Obtain the channel sequence correspondence between the original channel sequence of the feature data of the multiple channels in the image to be processed and the sequence of the encoded channels in the sorted feature data of the multiple channels.
- the encoding channel order refers to the channel order of the sorted feature data of each channel.
- the sorted channel order is called the encoding channel order.
- the channel order correspondence before and after sorting can be stored in the form of a sorted list channel_idx.
- Sorted lists can exist in various forms, including but not limited to: one-dimensional lists, two-dimensional lists, and three-dimensional lists.
- the original channel sequence is the Xth channel
- the corresponding encoding channel sequence is the Ith channel
- X may be the channel order corresponding to the feature data of the I-th channel after sorting and before sorting.
- the correspondence between the original channel order and the encoding channel order includes: when the number of time-domain splicing frames is at least two frames, the original channel order is the Xth channel, and the corresponding encoding channel order is the Xth channel. N I-th channel.
- X may be the channel sequence corresponding to the feature data of the I-th channel of the N-th frame after sorting and before sorting.
- the correspondence between the original channel order and the encoding channel order includes: when the number of time-domain splicing frames is at least two frames, the original channel order is the Xth channel, and the corresponding encoding channel order is the Xth channel.
- X may be the channel order corresponding to the sorting of the feature data of the I-th channel in the M-th region of the N-th frame after sorting.
- Step 605 splicing the sorted feature data of multiple channels to obtain a target feature frame sequence.
- the sorted feature data are spliced in a time-space domain according to a specific splicing method, and spliced into a target feature frame sequence with a time-domain splicing frame number of frame_count in the time domain.
- the number of frames spliced in the time domain is the number of frames obtained after splicing the feature data of the sorted multiple channels in the time domain set by the encoder.
- the encoder can flexibly set the number of time-domain splicing frames according to actual needs.
- the feature data after splicing is the feature data of row, row, col, column, and channels, and the number of channels of the feature data is C, if:
- the vacant feature data channel can be filled in the last frame to fill one frame for encoding.
- Step 606 Encode the target feature frame sequence, generate a code stream, and write the channel sequence correspondence into the code stream.
- An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 7 , the method includes the following steps:
- Step 701 Acquire feature data of multiple channels corresponding to the image to be processed.
- Step 702 Determine the feature data of the reference channel among the feature data of the multiple channels.
- Step 703 Using the feature data of the reference channel as the sorting start object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
- Step 704 determining that the number of spliced frames in the time domain is greater than one frame, and splicing the sorted feature data according to the splicing strategy in the temporal and spatial domain to obtain a target feature frame sequence.
- step 704 it is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the time and space domain to obtain the target feature frame sequence. Steps to achieve:
- Step 801 It is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the temporal and spatial domain to obtain the spliced feature data.
- Step 802 Determine the product of the number of rows of the spliced feature data, the number of columns of the spliced feature data, and the number of time-domain spliced frames.
- Step 803 Determine that the number of channels of the feature data of the multiple channels is less than the product, and fill in the area lacking the feature data channel in the spliced frame to obtain the target feature frame sequence.
- the region lacking the feature data channel in the spliced frame is filled, that is, the region lacking the feature data channel in the spliced feature frame sequence is filled, so as to improve the coding efficiency.
- the region lacking the feature data channel may be the region in the last frame in the spliced feature frame sequence.
- the region lacking the feature data channel may also be a region in at least one frame different from the last frame in the spliced feature frame sequence.
- step 704 it is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the temporal and spatial domains to obtain the target feature frame sequence, which can be obtained through the following steps: Steps to achieve:
- Step 901 Determine that the number of spliced frames in the time domain is greater than one frame, and perform splicing at the same position of different frames in the time domain according to the splicing strategy of the time domain first and then the spatial domain in the temporal domain according to the raster scanning order.
- Step 902 Perform splicing at adjacent positions in the spatial domain according to the raster scanning order, or stitching at adjacent positions in the spatial domain according to the zigzag scanning order.
- splicing is performed first in the time domain and then in the spatial domain, so that the feature data can be encoded using the inter-frame coding technology, so that the technology in the existing video coding standards can be reused to efficiently encode the feature data.
- step 704 it is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the temporal and spatial domain to obtain the target feature frame sequence, which can be obtained through the following steps: Steps to achieve:
- Step 1001 Determine that the number of splicing frames in the time domain is greater than one frame.
- Step 1001 Determine that the number of splicing frames in the time domain is greater than one frame.
- the splicing strategy of the first spatial domain and then the time domain splicing in adjacent positions in the spatial domain according to the raster scanning order, or in the spatial domain according to the zigzag scanning order. Splicing at adjacent locations.
- Step 1002 In the time domain, splicing is performed at the same position of different frames according to the raster scanning sequence.
- splicing is performed first in the spatial domain and then in the time domain, so that the feature data can be encoded using the intra-frame coding technology better, so that the technology in the existing video coding standards can be reused to efficiently encode the feature data.
- Step 705 Determine the number of splicing frames in the time domain as one frame, and splicing the sorted feature data in the spatial domain according to the splicing strategy to obtain a target feature frame sequence.
- Step 706 Encode the target feature frame sequence to generate a code stream.
- Step 707 Write the time-domain splicing frame number, the number of channels corresponding to the feature data of multiple channels, the height of the feature data of a single channel, and the width of the feature data of a single channel into the code stream.
- the raster scan splicing will be further described. Taking the splicing into a video sequence with a total number of 4 frames as an example, the schematic diagram of raster scan splicing is shown in Figure 11.
- the sorting order of the sorted feature data includes but not limited to:
- stitching is performed at the same position in different frames according to the raster scanning order in the time domain, and secondly, stitching is performed at adjacent positions according to the raster scanning order in the spatial domain;
- stitching is performed at adjacent positions in the spatial domain according to the raster scanning order, and secondly, stitching is performed at the same position in different frames according to the raster scanning order in the time domain.
- the zigzag scanning splicing is further explained, taking the splicing into a video sequence with a total number of 4 frames as an example, the schematic diagram of zigzag splicing is shown in Figure 12, and the sorting order of the feature data after sorting is shown in Figure 12. Including but not limited to:
- stitching is performed at the same position in different frames according to the raster scanning order in the time domain, and secondly, stitching is performed at adjacent positions according to the zigzag scanning order in the spatial domain;
- splicing is performed at adjacent positions according to the zigzag scanning order, and in the time domain, splicing is performed at the same position in different frames according to the raster scanning order.
- the additional information is also referred to as feature data spatiotemporal arrangement information: the number of channels of feature data C, the height of the feature data of a single channel h, the width w of the feature data of a single channel, the sorted list channel_idx, and the frame_count of the time-domain stitching frames.
- An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 13 , the method includes the following steps:
- Step 1101 Acquire feature data of multiple channels corresponding to the image to be processed.
- Step 1102 Determine the feature data of the reference channel among the feature data of the multiple channels.
- Step 1103 taking the feature data of the reference channel as the sorting starting object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
- Step 1104 splicing the sorted feature data in the airspace according to the strategy of filling first and then splicing.
- step 1104 splices the sorted feature data in the airspace according to the strategy of filling first and then splicing, which can be achieved by the following steps: filling each sorted feature data in the airspace, The feature data after filling is spliced on the above; wherein, there is a gap between the feature data of adjacent channels after filling.
- filling each sorted feature data in the airspace includes: filling between feature data of adjacent channels to ensure that there is a gap between the filled feature data of adjacent channels.
- the size of the gap between the feature data of adjacent channels may be the same.
- the distance between each small box and each dashed box is the same.
- the feature data of adjacent channels is filled, which reduces the mutual influence of values between different channels, and improves the signal fidelity of the channel boundary.
- Step 1105 Encode the target feature frame sequence to generate a code stream.
- Step 1106 write the time domain splicing frame number, the height of the feature data after filling and the width of the feature data after filling into the code stream, and write the number of channels corresponding to the feature data of multiple channels and the height of the feature data of a single channel. And the width of the feature data of a single channel is written into the code stream.
- the spatiotemporal arrangement information of the feature data further includes: the height of the filled feature data and the width of the filled feature data.
- step 1104 according to the strategy of first filling and then splicing in the airspace, the scheme of splicing the sorted feature data is also applicable to step 901, step 1001 and step 705;
- the process it is determined that the number of spliced frames in the time domain is greater than one frame, and in the temporal and spatial domain, according to the splicing strategy of the time domain first and then the spatial domain, in the temporal domain, splicing is performed in the same position of different frames according to the raster scan order;
- the strategy of post-splicing is to fill in each sorted feature data in the airspace, and splicing the filled feature data in the airspace at adjacent positions in the raster scan order in the airspace, or in the airspace according to the zigzag shape.
- the scan order is stitched at adjacent positions.
- step 1001 it is determined that the number of spliced frames in the time domain is greater than one frame, and based on the strategy of filling first and then splicing in the air domain, each sorted feature data is filled in the air domain, and the filling is performed in the air domain.
- the resulting feature data is spliced in adjacent positions in the spatial domain according to the raster scan order, or stitched in the adjacent positions in the spatial domain according to the zigzag scanning order, and then in the time domain according to the raster scan order in different frames. position for splicing.
- step 705 it is determined that the number of spliced frames in the time domain is one frame, and based on the strategy of filling first and then splicing in the air domain, each sorted feature data is filled in the air domain, and the filling is performed in the air domain. After the feature data, according to the splicing strategy in the airspace, the sorted feature data is spliced to obtain the target feature frame sequence.
- the feature data spatiotemporal arrangement information may be recorded in supplemental enhancement information (eg, Supplemental Enhancement Information (SEI) of the existing video coding standard H.265/HEVC, H.266/VVC, or extended data of the AVS standard (Extension Data)).
- SEI Supplemental Enhancement Information
- the sei_rbsp() of the existing video coding standard AVC/HEVC/VVC/EVC the sei_paylod() of sei_message() adds a new SEI category, namely Feature data quantization SEI message, and payloadType can be defined as any Other SEI numbers have not been used, such as 183.
- the syntax structure is shown in Table 1.
- the sorted list is a one-dimensional sorted list, its grammatical structure is:
- syntax elements can be encoded in different efficient entropy coding methods, where the syntax elements are:
- feature_channel_count The number of channels used to describe feature data is feature_channel_count
- feature_frame_count used to describe the number of frames after feature data splicing is feature_frame_count
- feature_single_channel_height The height used to describe the feature data of a single channel is feature_single_channel_height;
- feature_single_channel_width The width of the feature data used to describe a single channel is feature_single_channel_width;
- channel_idx[I] used to describe the channel order channel_idx[I] corresponding to the feature data of the I-th channel after sorting.
- An embodiment of the present disclosure provides a method for decoding feature data, which is applied to a decoder; with reference to FIG. 15 , the method includes the following steps:
- Step 1201 Parse the code stream to obtain the reconstructed feature frame sequence.
- Step 1202 Reversely sort the reconstructed feature frame sequence to obtain feature data of the reconstructed multiple channels.
- the decoding method provided by the embodiment of the present disclosure obtains the reconstructed feature frame sequence by parsing the code stream, and reversely sorts the reconstructed feature frame sequence to obtain the feature data of the reconstructed multiple channels, so that the spatiotemporal sequence can be accurately recovered.
- the feature data of the previous multiple channels is used for the subsequent network for further task inference analysis.
- An embodiment of the present disclosure provides a method for decoding feature data, which is applied to a decoder; with reference to FIG. 16 , the method includes the following steps:
- Step 1301 Parse the code stream to obtain the reconstructed feature frame sequence, the channel sequence correspondence, the number of channels, the number of time-domain spliced frames, the height of the feature data of a single channel, and the width of the feature data of a single channel.
- Step 1302 Determine the location of the feature data of each channel in the reconstructed feature frame sequence based on the number of channels, the number of time-domain splicing frames, the height of the feature data of a single channel, and the width of the feature data of a single channel.
- Step 1303 based on the channel sequence correspondence, determine the original channel sequence of the feature data at different positions in the reconstructed feature frame sequence.
- Step 1304 based on the original channel sequence, reversely sort the feature data at different positions in the reconstructed feature frame sequence to obtain the feature data of the reconstructed multiple channels.
- the decoding end after decoding the reconstructed feature frame sequence and the reconstructed feature data spatiotemporal arrangement information, performs a spatiotemporal inverse arrangement operation on the reconstructed feature frame sequence to obtain the reconstructed feature data.
- the steps are as follows:
- the frame_count of the time-domain splicing frame, and the height h of the feature data of a single channel and the width w of the feature data of a single channel determine each feature frame sequence in the feature frame sequence. The location of the feature data of the channel;
- An embodiment of the present disclosure provides a method for decoding feature data, which is applied to a decoder; with reference to FIG. 17 , the method includes the following steps:
- Step 1401 Parse the code stream to obtain the reconstructed feature frame sequence, the channel sequence correspondence, the number of channels, the number of time-domain stitching frames, the height of the filled feature data, the width of the filled feature data, and the size of the feature data of a single channel. Height and width of feature data for a single channel.
- Step 1402 Determine the reconstructed feature frame based on the number of channels, the number of time-domain stitching frames, the height of the filled feature data, the width of the filled feature data, the height of the feature data of a single channel, and the width of the feature data of a single channel The location of the feature data for each channel in the sequence.
- Step 1403 based on the channel sequence correspondence, determine the original channel sequence of the feature data at different positions in the reconstructed feature frame sequence.
- Step 1404 based on the original channel sequence, reversely sort the feature data at different positions in the reconstructed feature frame sequence to obtain the reconstructed feature data of multiple channels.
- the present disclosure has at least the following beneficial effects: based on the information redundancy between different channels of the multi-channel feature data output by the intermediate layer of the neural network, all channels of the multi-channel feature data are sorted according to similarity, and then sorted in the time domain and The feature frame sequence is arranged in the space domain, so that the feature data channel with high similarity in the adjacent area can be referred to during encoding, and the encoding efficiency of the feature data can be improved. If the splicing is performed first in the time domain and then in the spatial domain, the inter-frame coding technology can be used to encode the feature data. The feature data is encoded so that techniques in existing video coding standards can be reused to efficiently encode the feature data.
- the present disclosure encodes the multi-channel feature data output by the middle layer of the neural network. Arranged into a sequence of feature frames. Since the correlation between adjacent channels in the temporal and spatial domains after arrangement is relatively large, the present disclosure can better utilize the existing intra-frame prediction and inter-frame prediction, and further improves the coding efficiency of the feature data. In order to restore the multi-channel feature data before the spatiotemporal arrangement after decoding, it is necessary to record the spatiotemporal arrangement information of the feature data in the code stream.
- FIG. 18 is a schematic diagram of the composition and structure of an encoding device provided by an embodiment of the present disclosure.
- the encoding device 150 includes a first obtaining unit 1501, a first processing unit 1502, and an encoding unit 1503, wherein:
- the first obtaining unit 1501 is configured to obtain feature data of multiple channels corresponding to the image to be processed
- a first processing unit 1502 configured to determine the feature data of the reference channel in the feature data of the multiple channels
- the first processing unit 1502 is configured to take the feature data of the reference channel as the starting object for sorting, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted multi-channel feature data. characteristic data of each channel;
- the first processing unit 1502 is configured to splicing the feature data of the sorted multiple channels to obtain a target feature frame sequence
- the encoding unit 1503 is configured to encode the target feature frame sequence to generate a code stream.
- the first processing unit 1502 is configured to, when the accumulated sum of the feature data values in the feature data of multiple channels satisfies the target threshold, determine that the feature data of the channel corresponding to the accumulated sum is the feature data of the reference channel .
- the cumulative sum of the characteristic data values satisfying the target threshold includes: the cumulative sum of the characteristic data values is the largest, or the cumulative sum of the characteristic data values is the smallest.
- the first obtaining unit 1501 is configured to obtain the original channel sequence of the feature data of the multiple channels in the image to be processed, and the sequence of the encoded channels in the sorted feature data of the multiple channels. Correspondence between the channel sequence;
- the encoding unit 1503 is configured to write the channel sequence correspondence into the code stream.
- the channel sequence correspondence includes:
- the original channel sequence is the Xth channel
- the corresponding encoding channel sequence is the Ith channel
- the original channel sequence is the Xth channel
- the corresponding encoding channel sequence is the Nth Ith channel.
- the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, and splices the sorted feature data according to the splicing strategy in the spatiotemporal domain to obtain the target feature frame sequence.
- the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, and according to the splicing strategy in the temporal and spatial domains, splicing the sorted feature data to obtain the spliced feature data;
- the number of channels of the feature data of the multiple channels is less than the product, and the area lacking the feature data channel in the spliced frame is filled to obtain the target feature frame sequence.
- the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, in the temporal and spatial domain according to the splicing strategy of the time domain first and then the spatial domain, in the temporal domain according to the raster scanning order in different The same position of the frame is stitched;
- the splicing is performed at adjacent positions in the spatial domain according to the raster scan order, or in the adjacent positions according to the zigzag scanning order in the spatial domain.
- the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, and in the temporal and spatial domain, according to the splicing strategy of the spatial domain first and then the temporal domain, in the spatial domain according to the raster scanning order in adjacent splicing at adjacent positions in the airspace according to the zigzag scanning order;
- stitching is performed at the same position in different frames according to the raster scan order.
- the first processing unit 1502 is configured to determine the number of splicing frames in the time domain as one frame, and splicing the sorted channel feature data in the spatial domain according to the splicing strategy to obtain the target feature frame sequence.
- the first processing unit 1502 is configured to splicing the sorted channel feature data in the airspace according to a strategy of filling first and then splicing.
- the first processing unit 1502 is configured to fill each sorted feature data in the airspace, and splicing the filled feature data in the airspace; wherein, the filled adjacent channels There are gaps between the feature data.
- the encoding unit 1503 is configured to write the height of the padded feature data and the width of the padded feature data into the code stream. Write the number of channels corresponding to the feature data of multiple channels, the height of the feature data of a single channel, and the width of the feature data of a single channel into the code stream. Write the time-domain splicing frame number into the code stream.
- the first obtaining unit 1501 is configured to obtain an image to be processed
- the first processing unit 1502 is configured to perform feature extraction on the image to be processed through a neural network model to obtain feature data of multiple channels.
- FIG. 19 is a schematic diagram of the composition and structure of a decoding device provided by an embodiment of the present disclosure.
- the decoding device 160 includes a decoding unit 1601 and a second processing unit 1602, wherein:
- the decoding unit 1601 is configured to parse the code stream to obtain the reconstructed feature frame sequence
- the second processing unit 1602 is configured to reversely sort the reconstructed sequence of feature frames to obtain reconstructed feature data of multiple channels.
- the decoding unit 1601 is configured to parse the code stream to obtain the channel sequence correspondence, the number of channels, the number of time-domain splicing frames, the height of the feature data of a single channel, and the width of the feature data of a single channel.
- the second processing unit 1602 is configured to determine the position of the feature data of each channel in the reconstructed sequence of feature frames based on the number of channels, the number of time-domain splicing frames, the height of the feature data of a single channel, and the width of the feature data of a single channel ; Based on the channel sequence correspondence, determine the original channel sequence of the feature data at different positions in the reconstructed feature frame sequence; Based on the original channel sequence, reverse sort the feature data at different positions in the reconstructed feature frame sequence to obtain multiple reconstructed Characteristic data for the channel.
- the decoding unit 1601 is configured to parse the code stream to obtain the height of the padded feature data and the width of the padded feature data;
- the second processing unit 1602 is configured to be based on the number of channels, the number of time-domain splicing frames, the height of the filled feature data, the width of the filled feature data, the height of the feature data of a single channel, and the width of the feature data of a single channel, Determine the position of the feature data of each channel in the reconstructed feature frame sequence.
- FIG. 20 is a schematic diagram of the composition and structure of an encoding device provided by an embodiment of the present disclosure.
- the encoding device 170 (the encoding device 170 in FIG. 20 corresponds to the encoding device 150 in FIG. 18 ) includes a first memory 1701 and The first processor 1702, wherein:
- the first processor 1702 is configured to implement the encoding method provided by the embodiment of the present disclosure when executing the encoding instruction stored in the first memory 1701 .
- the first processor 1702 may be implemented by software, hardware, firmware or a combination thereof, and may use circuits, single or multiple application specific integrated circuits (ASIC), single or multiple general-purpose integrated circuits, single or multiple A microprocessor, single or multiple programmable logic devices, or a combination of the foregoing circuits or devices, or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing encoding methods.
- ASIC application specific integrated circuits
- a microprocessor single or multiple programmable logic devices
- a combination of the foregoing circuits or devices or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing encoding methods.
- FIG. 21 is a schematic structural diagram of a decoding device provided by an embodiment of the present disclosure.
- the decoding device 180 (the decoding device 180 in FIG. 21 corresponds to the decoding device 160 in FIG. 19 ) includes a second memory 1801 and The second processor 1802, wherein:
- the second processor 1802 is configured to implement the decoding method provided by the embodiment of the present disclosure when executing the decoding instruction stored in the second memory 1801.
- the second processor 1802 may be implemented by software, hardware, firmware or a combination thereof, and may use circuits, single or multiple application specific integrated circuits (ASIC), single or multiple general-purpose integrated circuits, single or multiple A microprocessor, single or multiple programmable logic devices, or a combination of the foregoing circuits or devices, or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing decoding methods.
- ASIC application specific integrated circuits
- a microprocessor single or multiple programmable logic devices
- a combination of the foregoing circuits or devices or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing decoding methods.
- Each component in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of software function modules.
- the integrated unit is implemented in the form of software function modules and is not sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of this embodiment is essentially or correct. Part of the contribution made by the prior art or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can be a personal A computer, a cloud server, or a network device, etc.) or a processor (processor) executes all or part of the steps of the method in this embodiment.
- a computer device which can be a personal A computer, a cloud server, or a network device, etc.
- a processor processor
- the aforementioned storage media include: magnetic random access memory (FRAM, ferromagnetic random access memory), read only memory (ROM, Read Only Memory), programmable read only memory (PROM, Programmable Read-Only Memory), erasable memory Programmable Read-Only Memory (EPROM, Erasable Programmable Read-Only Memory), Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), Flash Memory (Flash Memory), Magnetic Surface Memory, Optical Disc , or various media that can store program codes, such as a CD-ROM (Compact Disc Read-Only Memory), the embodiment of the present disclosure is not limited.
- FRAM magnetic random access memory
- ROM read only memory
- PROM programmable Read-Only Memory
- EPROM Erasable Programmable Read-Only Memory
- EEPROM Electrically Erasable Programmable Read-Only Memory
- Flash Memory Flash Memory
- magnetic Surface Memory Optical Disc
- CD-ROM Compact Disc Read-
- Embodiments of the present disclosure further provide a computer-readable storage medium storing executable coding instructions for causing the first processor to execute the coding method provided by the embodiments of the present disclosure.
- Embodiments of the present disclosure further provide a computer-readable storage medium storing executable decoding instructions, which are configured to implement the decoding method provided by the embodiments of the present disclosure when the second processor is executed.
- the embodiments of the present disclosure provide an encoding method, a decoding method, an encoder, a decoder, and a storage medium for feature data.
- a reference in the feature data of multiple channels is determined;
- the feature data of the channel take the feature data of the reference channel as the starting object for sorting, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted data of the multiple channels.
- the feature data of one channel is used as the benchmark, that is, the feature data of the reference channel is determined; compared with the feature data of the reference channel, the feature data of all channels are sorted in descending order of similarity; After sorting, the correlation between the feature data of adjacent channels in the spatiotemporal domain is relatively large, so that the feature data channel with higher similarity in the adjacent region can be referred to in the subsequent coding, thereby improving the coding efficiency of the feature data.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
Claims (25)
- 一种特征数据的编码方法,包括:获取待处理图像对应的多个通道的特征数据;确定所述多个通道的特征数据中的参考通道的特征数据;以所述参考通道的特征数据为排序起始对象,按照所述多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;对所述目标特征帧序列进行编码,生成码流。
- 根据权利要求1所述方法,所述确定所述多个通道的特征数据中的参考通道的特征数据,包括:当所述多个通道的特征数据中特征数据值的累加和满足目标阈值时,确定所述累加和对应的通道的特征数据为所述参考通道的特征数据。
- 根据权利要求2所述方法,所述特征数据值的累加和满足所述目标阈值包括:所述特征数据值的累加和最大,或者,所述特征数据值的累加和最小。
- 根据权利要求1所述方法,所述以所述参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据之后,所述方法包括:获得所述多个通道的特征数据在所述待处理图像中的原始通道顺序,与在所述排序后的多个通道的特征数据中的编码通道顺序之间的通道顺序对应关系;将所述通道顺序对应关系写入所述码流中。
- 根据权利要求4所述方法,所述通道顺序对应关系,包括:当时域拼接帧数为一帧时,所述原始通道顺序为第X个通道,对应的所述编码通道顺序为第I个通道;当所述时域拼接帧数为至少两帧时,所述原始通道顺序为第X个通道,对应的所述编码通道顺序为第N第I个通道。
- 根据权利要求1所述方法,所述将所述排序后的特征数据进行拼接,得到目标特征帧序列,包括:确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,得到目标特征帧序列。
- 根据权利要求6所述方法,所述确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,得到目标特征帧序列,包括:确定所述时域拼接帧数大于一帧,在时空域上按照所述拼接策略,将所述排序后的特征数据进行拼接,得到拼接后的特征数据;确定拼接后的特征数据的行数、拼接后的特征数据的列数以及所述时域拼接帧数的乘积;确定所述多个通道的特征数据的通道数小于所述乘积,对所述拼接后的帧中缺少特征数据通道的区域进行填充,得到所述目标特征帧序列。
- 根据权利要求6所述方法,所述确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,包括:确定时域拼接帧数大于一帧,在时空域上按照先时域后空域的拼接策略,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接;在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺 序在相邻位置进行拼接。
- 根据权利要求6所述方法,所述确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,包括:确定时域拼接帧数大于一帧,在时空域上按照先空域后时域的拼接策略,在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行拼接;在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
- 根据权利要求1所述方法,所述将所述排序后的特征数据进行拼接,得到目标特征帧序列,包括:确定时域拼接帧数为一帧,在空域上按照拼接策略,将所述排序后的特征数据进行拼接,得到所述目标特征帧序列。
- 根据权利要求1所述方法,所述将所述排序后的特征数据进行拼接,包括:在空域上按照先填充后拼接的策略,将所述排序后的特征数据进行拼接。
- 根据权利要求11所述方法,所述在空域上按照先填充后拼接的策略,将所述排序后的特征数据进行拼接,包括:在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据进行拼接;其中,填充后的相邻通道的特征数据之间具有缝隙。
- 根据权利要求12所述方法,所述在空域上将填充后的特征数据进行拼接之后,所述方法包括:将填充后的特征数据的高度和填充后的特征数据的宽度写入所述码流。
- 根据权利要求1所述方法,所述方法还包括:将所述多个通道的特征数据对应的通道数量、单个通道的特征数据的高度和单个通道的特征数据的宽度写入所述码流中。
- 根据权利要求5所述方法,所述方法还包括:将所述时域拼接帧数写入所述码流中。
- 根据权利要求1所述方法,所述方法还包括:获取待处理图像;通过神经网络模型对所述待处理图像进行特征提取,得到所述多个通道的特征数据。
- 一种特征数据的解码方法,包括:解析码流,获得重建的特征帧序列;对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
- 根据权利要求17所述的方法,所述方法还包括:解析所述码流,获得通道顺序对应关系、通道数量、时域拼接帧数、单个通道的特征数据的高度和单个通道的特征数据的宽度;基于所述通道数量、所述时域拼接帧数、所述单个通道的特征数据的高度以及所述单个通道的特征数据的宽度,确定所述重建的特征帧序列中每一通道的特征数据在的位置;相应的,所述对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据,包括:基于所述通道顺序对应关系,确定所述重建的特征帧序列中不同位置的特征数据的原始通道顺序;基于所述原始通道顺序,对所述重建的特征帧序列中不同位置的特征数据进行逆排序,得到所述重建的多个通道的特征数据。
- 根据权利要求18所述的方法,所述方法还包括:解析所述码流,获得填充后的特征数据的高度和填充后的特征数据的宽度;相应的,所述基于所述通道数量、所述时域拼接帧数、所述单个通道的特征数据的高度以及所述单个通道的特征数据的宽度,确定所述重建的特征帧序列中每一通道的特征数据在的位置,包括:基于所述通道数量、所述时域拼接帧数、所述填充后的特征数据的高度、所述填充后的特征数据的宽度、所述单个通道的特征数据的高度以及所述单个通道的特征数据的宽度,确定所述重建的特征帧序列中每一通道的特征数据在的位置。
- 一种编码器,所述编码器包括第一获得单元、第一处理单元和编码单元;其中,所述第一获得单元,配置为获取待处理图像对应的多个通道的特征数据;所述第一处理单元,配置为确定所述多个通道的特征数据中的参考通道的特征数据;所述第一处理单元,配置为以所述参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;所述第一处理单元,配置为将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;所述编码单元,配置为对所述目标特征帧序列进行编码,生成码流。
- 一种编码器,所述编码器包括第一存储器和第一处理器;其中,所述第一存储器,用于存储能够在所述第一处理器上运行的计算机程序;所述第一处理器,用于在运行所述计算机程序时,执行如权利要求1至16任一项所述的方法。
- 一种解码器,所述解码器包括解码单元和第二处理单元;其中,所述解码单元,配置为解析码流,获得重建的特征帧序列;所述第二处理单元,配置为对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
- 一种解码器,所述解码器包括第二存储器和第二处理器;其中,所述第二存储器,用于存储能够在所述第二处理器上运行的计算机程序;所述第二处理器,用于在运行所述计算机程序时,执行如权利要求17至19任一项所述的方法。
- 一种计算机可读存储介质,存储有可执行编码指令,用于引起第一处理器执行时,实现权利要求1至16任一项所述的方法。
- 一种计算机可读存储介质,存储有可执行解码指令,用于引起第二处理器执行时,实现权利要求17至19中任一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/078550 WO2022183346A1 (zh) | 2021-03-01 | 2021-03-01 | 特征数据的编码方法、解码方法、设备及存储介质 |
CN202180094401.6A CN116868570A (zh) | 2021-03-01 | 2021-03-01 | 特征数据的编码方法、解码方法、设备及存储介质 |
EP21928437.9A EP4304176A1 (en) | 2021-03-01 | 2021-03-01 | Feature data encoding method, feature data decoding method, devices, and storage medium |
US18/458,937 US20230412820A1 (en) | 2021-03-01 | 2023-08-30 | Methods for encoding and decoding feature data, and decoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/078550 WO2022183346A1 (zh) | 2021-03-01 | 2021-03-01 | 特征数据的编码方法、解码方法、设备及存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/458,937 Continuation US20230412820A1 (en) | 2021-03-01 | 2023-08-30 | Methods for encoding and decoding feature data, and decoder |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022183346A1 true WO2022183346A1 (zh) | 2022-09-09 |
Family
ID=83153804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/078550 WO2022183346A1 (zh) | 2021-03-01 | 2021-03-01 | 特征数据的编码方法、解码方法、设备及存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20230412820A1 (zh) |
EP (1) | EP4304176A1 (zh) |
CN (1) | CN116868570A (zh) |
WO (1) | WO2022183346A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109254946A (zh) * | 2018-08-31 | 2019-01-22 | 郑州云海信息技术有限公司 | 图像特征提取方法、装置、设备及可读存储介质 |
CN110494892A (zh) * | 2017-05-31 | 2019-11-22 | 三星电子株式会社 | 用于处理多通道特征图图像的方法和装置 |
WO2021011315A1 (en) * | 2019-07-15 | 2021-01-21 | Facebook Technologies, Llc | System and method for shift-based information mixing across channels for shufflenet-like neural networks |
-
2021
- 2021-03-01 CN CN202180094401.6A patent/CN116868570A/zh active Pending
- 2021-03-01 EP EP21928437.9A patent/EP4304176A1/en active Pending
- 2021-03-01 WO PCT/CN2021/078550 patent/WO2022183346A1/zh active Application Filing
-
2023
- 2023-08-30 US US18/458,937 patent/US20230412820A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110494892A (zh) * | 2017-05-31 | 2019-11-22 | 三星电子株式会社 | 用于处理多通道特征图图像的方法和装置 |
CN109254946A (zh) * | 2018-08-31 | 2019-01-22 | 郑州云海信息技术有限公司 | 图像特征提取方法、装置、设备及可读存储介质 |
WO2021011315A1 (en) * | 2019-07-15 | 2021-01-21 | Facebook Technologies, Llc | System and method for shift-based information mixing across channels for shufflenet-like neural networks |
Also Published As
Publication number | Publication date |
---|---|
US20230412820A1 (en) | 2023-12-21 |
CN116868570A (zh) | 2023-10-10 |
EP4304176A1 (en) | 2024-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210203997A1 (en) | Hybrid video and feature coding and decoding | |
CN103220528B (zh) | 通过使用大型变换单元编码和解码图像的方法和设备 | |
US9083947B2 (en) | Video encoder, video decoder, method for video encoding and method for video decoding, separately for each colour plane | |
CN110691250B (zh) | 结合块匹配和串匹配的图像压缩装置 | |
CN103621096A (zh) | 用于使用自适应滤波对图像进行编码和解码的方法和设备 | |
US20160050440A1 (en) | Low-complexity depth map encoder with quad-tree partitioned compressed sensing | |
US20130163676A1 (en) | Methods and apparatus for decoding video signals using motion compensated example-based super-resolution for video compression | |
WO2020001325A1 (zh) | 一种图像编码方法、解码方法、编码器、解码器及存储介质 | |
US11838519B2 (en) | Image encoding/decoding method and apparatus for signaling image feature information, and method for transmitting bitstream | |
CN104704826A (zh) | 两步量化和编码方法和装置 | |
JPWO2006035883A1 (ja) | 画像処理装置、画像処理方法、および画像処理プログラム | |
US20230396787A1 (en) | Video compression method and apparatus, computer device, and storage medium | |
Zhu et al. | Video coding with spatio-temporal texture synthesis and edge-based inpainting | |
US20230388490A1 (en) | Encoding method, decoding method, and device | |
WO2022183346A1 (zh) | 特征数据的编码方法、解码方法、设备及存储介质 | |
RU2766557C1 (ru) | Устройство обработки изображений и способ выполнения эффективного удаления блочности | |
CN114846789B (zh) | 用于指示条带的图像分割信息的解码器及对应方法 | |
Misra et al. | Video feature compression for machine tasks | |
CN1672420A (zh) | 压缩包括交替镜头的视频序列的数字数据的方法 | |
US20230412817A1 (en) | Encoding method, decoding method, and decoder | |
WO2022073159A1 (zh) | 特征数据的编解码方法、装置、设备及存储介质 | |
RU2787812C2 (ru) | Способ и аппаратура предсказания видеоизображений | |
RU2779474C1 (ru) | Устройство обработки изображений и способ выполнения эффективного удаления блочности | |
CN116708787A (zh) | 编解码方法和装置 | |
CN115604486A (zh) | 视频图像的编解码方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21928437 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180094401.6 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021928437 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2021928437 Country of ref document: EP Effective date: 20231002 |