WO2022183346A1 - 特征数据的编码方法、解码方法、设备及存储介质 - Google Patents

特征数据的编码方法、解码方法、设备及存储介质 Download PDF

Info

Publication number
WO2022183346A1
WO2022183346A1 PCT/CN2021/078550 CN2021078550W WO2022183346A1 WO 2022183346 A1 WO2022183346 A1 WO 2022183346A1 CN 2021078550 W CN2021078550 W CN 2021078550W WO 2022183346 A1 WO2022183346 A1 WO 2022183346A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature data
feature
channel
splicing
multiple channels
Prior art date
Application number
PCT/CN2021/078550
Other languages
English (en)
French (fr)
Inventor
虞露
邵宇超
潘雅庆
于化龙
戴震宇
Original Assignee
浙江大学
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学, Oppo广东移动通信有限公司 filed Critical 浙江大学
Priority to PCT/CN2021/078550 priority Critical patent/WO2022183346A1/zh
Priority to CN202180094401.6A priority patent/CN116868570A/zh
Priority to EP21928437.9A priority patent/EP4304176A1/en
Publication of WO2022183346A1 publication Critical patent/WO2022183346A1/zh
Priority to US18/458,937 priority patent/US20230412820A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • H04N19/107Selection of coding mode or of prediction mode between spatial and temporal predictive coding, e.g. picture refresh
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/593Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques

Definitions

  • the embodiments of the present disclosure relate to encoding and decoding technologies in the field of communications, and in particular, to an encoding method, a decoding method, an encoder, a decoder, and a storage medium for feature data.
  • the feature map encoding and decoding process includes three main modules: pre-quantization/de-pre-quantization, repackaging/de-repackaging, and traditional video encoding/decoding.
  • the pre-quantized and repacked feature map array data is sent to the traditional video encoder in the form of luminance chrominance (YUV) video data for compression encoding, and the code stream generated by the traditional video encoder is included in the feature map data stream.
  • YUV luminance chrominance
  • there are multiple modes for repackaging/anti-repackaging which are specified order stacking of feature maps, default order of feature maps, or specified order tiling.
  • the multi-channel data of the feature is tiled in an image in a single list order, and the multi-channel data are closely adjacent, which leads to the encoding of the tiled image in the processing method of the existing feature data.
  • the block division operation divides the data of multiple channels into the same coding unit. Due to the discontinuity between different channel data, the correlation of different channel data in the same coding unit is poor, so that the efficiency of the existing feature data processing method cannot be effectively utilized.
  • the embodiments of the present disclosure provide an encoding method, a decoding method, an encoder, a decoder, and a storage medium for feature data.
  • a decoding method By sorting all channel feature data, the correlation between adjacent channels in the time-space domain after sorting is relatively large. , so that the feature data channel with higher similarity in adjacent regions can be referred to in subsequent coding, thereby improving the coding efficiency of feature data.
  • an embodiment of the present disclosure provides a method for encoding feature data, including:
  • the target feature frame sequence is encoded to generate a code stream.
  • an embodiment of the present disclosure also provides a method for decoding feature data, including:
  • Reverse sorting is performed on the reconstructed feature frame sequence to obtain reconstructed feature data of multiple channels.
  • an embodiment of the present disclosure provides an encoder, the encoder includes a first obtaining unit, a first processing unit, and an encoding unit; wherein,
  • the first obtaining unit is configured to obtain feature data of multiple channels corresponding to the image to be processed
  • the first processing unit configured to determine the feature data of the reference channel in the feature data of the multiple channels
  • the first processing unit is configured to take the feature data of the reference channel as a sorting start object, and sort the feature data of the multiple channels in an order of decreasing similarity between the feature data of the multiple channels, Obtain the feature data of the sorted multiple channels;
  • the first processing unit is configured to splicing the feature data of the sorted multiple channels to obtain a target feature frame sequence
  • the encoding unit is configured to encode the target feature frame sequence to generate a code stream.
  • an embodiment of the present disclosure provides a decoder, the decoder includes a decoding unit and a second processing unit; wherein,
  • the decoding unit is configured to parse the code stream to obtain the reconstructed feature frame sequence
  • the second processing unit is configured to reversely sort the reconstructed sequence of feature frames to obtain reconstructed feature data of multiple channels.
  • an encoder including:
  • the first memory for storing a computer program executable on the first processor
  • the first processor is configured to execute the encoding method according to the first aspect when running the computer program.
  • an embodiment of the present disclosure further provides a decoder, including:
  • the second memory for storing a computer program executable on the second processor
  • the second processor is configured to execute the decoding method according to the second aspect when running the computer program.
  • an embodiment of the present disclosure provides a computer-readable storage medium storing executable encoding instructions for causing the first processor to execute the encoding method described in the first aspect.
  • an embodiment of the present disclosure provides a computer-readable storage medium storing executable decoding instructions for causing the second processor to execute the decoding method described in the second aspect.
  • Embodiments of the present disclosure provide a feature data encoding method, decoding method, encoder, decoder, and storage medium, wherein the feature data encoding method obtains feature data of multiple channels corresponding to an image to be processed;
  • the feature data of the reference channel in the feature data of the channel take the feature data of the reference channel as the sorting starting object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain The sorted feature data of multiple channels; splicing the sorted feature data of multiple channels to obtain a target feature frame sequence; encoding the target feature frame sequence to generate a code stream;
  • the feature data of one channel is used as the benchmark, that is, the feature data of the reference channel is determined; according to the order of similarity compared with the feature data of the reference channel, the features of all channels are analyzed in descending order of similarity. In this way, the correlation between the feature data of adjacent channels in the spatiotemporal domain after sorting is relatively large,
  • FIG. 1 is a schematic diagram of a "pre-analysis and recompression" framework provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an encoding process in a related art provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of an encoding process in a related art provided by an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of spatiotemporal splicing in the related art provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic flowchart 1 of an exemplary method for encoding feature data according to an embodiment of the present disclosure
  • FIG. 6 is a second schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
  • FIG. 7 is a third schematic flowchart of an exemplary method for encoding feature data provided by an embodiment of the present disclosure.
  • FIG. 8 is a fourth schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
  • FIG. 9 is a schematic flowchart 5 of an exemplary method for encoding feature data according to an embodiment of the present disclosure.
  • FIG. 10 is a sixth schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
  • FIG. 11 is a schematic diagram of raster scan stitching provided by an embodiment of the present disclosure.
  • FIG. 12 is a schematic diagram of zigzag scanning and splicing provided by an embodiment of the present disclosure
  • FIG. 13 is a seventh schematic flowchart of an exemplary method for encoding feature data according to an embodiment of the present disclosure
  • FIG. 14 is a schematic diagram of filling between feature data of adjacent channels in a space provided by an embodiment of the present disclosure
  • FIG. 15 is a schematic flowchart 1 of an exemplary method for decoding feature data according to an embodiment of the present disclosure
  • 16 is a second schematic flowchart of an exemplary method for decoding feature data according to an embodiment of the present disclosure
  • FIG. 17 is a third schematic flowchart of an exemplary method for decoding feature data according to an embodiment of the present disclosure.
  • FIG. 19 is a schematic structural diagram of a decoder according to an embodiment of the present disclosure.
  • 20 is a schematic structural diagram of an encoder according to an embodiment of the present disclosure.
  • FIG. 21 is a schematic structural diagram of a decoder according to an embodiment of the present disclosure.
  • first ⁇ second ⁇ third involved in the embodiments of the present disclosure is only to distinguish similar objects, and does not represent a specific ordering of objects. It can be understood that “first ⁇ second ⁇ third” "Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the disclosure described herein to be practiced in sequences other than those illustrated or described herein.
  • the three-dimensional feature data tensor (3D Feature Data Tensor) includes the number of channels (Channel, C), height (Height, H), and width (Width, W).
  • Feature data which refers to the output data at the intermediate layer of neural networks.
  • videos and images are also used to analyze and understand the semantic information in them.
  • the traditional direct compression and coding of images is converted to the compression and coding of feature data output by the middle layer of the intelligent analysis task network.
  • End-side devices such as cameras first use the task network to pre-analyze the raw video and image data collected or input, extract enough feature data for cloud analysis, and compress, encode and transmit these feature data.
  • the cloud device After the cloud device receives the corresponding code stream, it reconstructs the corresponding feature data according to the syntax information of the code stream, and inputs it into the specific task network for further analysis.
  • This "pre-analysis and recompression" framework is shown in Figure 1. Under this framework, there is a large amount of characteristic data transmission between the terminal-side device and the cloud device. The purpose of characteristic data compression is to compress and encode the characteristic data extracted from the existing task network in a recoverable way for the cloud to further intelligent analytical processing.
  • the neural network includes at least one neural network.
  • the neural network includes task network A, task network B, and task network C. These neural networks may be the same or different. Taking the neural network with 10 layers as an example, because the local computing power of the image acquisition device is not enough, only 5 layers can be executed.
  • the image acquisition device processes the original feature data to obtain the corresponding features.
  • Characteristic data of the data input condition of the encoding device further, the image acquisition device sends the characteristic data that meets the input condition to the encoding device, and the encoding device encodes the characteristic data that meets the input condition and writes it into the code stream.
  • the encoding device sends the code stream to the feature decoding device, where the feature decoding device may be set in a cloud device such as a cloud server. That is to say, after the end-side device obtains the code stream, it will be handed over to the cloud server for processing.
  • the cloud server decodes and reconstructs the code stream through the feature decoding device, and obtains the reconstructed feature data; finally, the cloud server inputs the reconstructed feature data corresponding to each channel to the sixth layer of the neural network, and continuously executes to the tenth layer. get the recognition result.
  • the Moving Picture Experts Group (MPEG) established the Video Coding for Machines at its 127th meeting in July 2019.
  • VCM Video Coding for Machines at its 127th meeting in July 2019.
  • VCM standard working group to study the technology in this area, aiming to define a code stream for compressed video or feature information extracted from video, so that it can use the same code stream without significantly reducing the performance of intelligent task analysis. Perform multiple intelligent analysis tasks, and the decompressed information is more friendly to intelligent analysis tasks, and the performance loss of intelligent analysis tasks is smaller under the same bit rate.
  • the VCM standard working group has designed the corresponding potential coding flow chart, as shown in Figure 2, in order to improve the coding efficiency of video and images under intelligent analysis tasks.
  • the video and image can directly pass through the video and image encoder optimized for the task, or use network pre-analysis to extract feature data and encode it, and then input the decoded feature data into the subsequent network for further analysis.
  • it is necessary to multiplex the existing video and image coding standards to compress the extracted feature data it is necessary to perform fixed-point processing on the feature data represented by floating point, and at the same time convert it into an input suitable for the existing coding and decoding standards. For example, multi-channel feature data is spliced into a single-frame or multi-frame YUV format feature frame sequence and input into a video encoder for compression encoding.
  • the feature data compression technology output by the middle layer of the task network is worthy of in-depth study in the encoding and decoding process.
  • the coding efficiency of the feature data output by different levels of some commonly used task networks in lossless compression and lossy compression is studied.
  • the reference software of the video coding standard H.265/HEVC to compress and encode the feature data
  • the applicant believes that the signal fidelity of the feature data is not much different in a larger code rate range, and when the code rate is lower than At a certain threshold, the signal fidelity of the feature data decreases sharply.
  • the research utilizes existing video coding standards to compress feature data lossy, and by introducing lossy compression into network training, a strategy to improve task accuracy during lossy compression is proposed.
  • the task accuracy is used as the evaluation index, and the compressed feature data can obtain higher performance than the target data in some cases. Therefore, the three tasks of image classification, image retrieval and image recognition are established. corresponding evaluation indicators. Since the task network may be over-fitted or under-fitted after training, the task performance may be higher than the target performance when the feature data rate is high, and a set of evaluation indicators is established for each task. The adaptability is poor, so an appropriate code rate interval can be selected, that is, the coding efficiency of the feature data can be measured without considering the situation that the code rate is too high and the task performance is too low.
  • neural networks can also be used to reduce the dimension of feature data to achieve the goal of compressing the amount of data.
  • the feature data compression technology studied is only for special application scenarios, only for a few task networks whose intermediate layer data volume is smaller than the target input, and for other large-scale task networks.
  • the feature data output by most task networks is poor;
  • the second is that the research method only considers the data volume after feature data compression and does not consider the task quality. For example, using neural networks to reduce the dimension of feature data, it is difficult to achieve high precision.
  • the third is to combine the traditional video and image coding technology to compress the feature data without considering the feature data. Different from the traditional video and image, the existing video and image coding technology cannot be efficiently used to achieve higher coding efficiency.
  • FIG. 3 The feature data encoding and decoding process is shown in FIG. 3 , which includes three main modules: pre-quantization/de-pre-quantization; repackaging/de-repackaging; and traditional video encoding/decoding.
  • the specific module contents are as follows.
  • Pre-quantization/Inverse pre-quantization When the target input feature map is a floating point type, the feature map needs to be pre-quantized to convert it into integer data that meets the input requirements of traditional video encoders.
  • the repackaging module transforms the three-dimensional array of target feature maps into YUV format information that meets the input requirements of traditional video encoders. At the same time, by changing the combination method of the feature maps, the coding efficiency of the feature map data of the traditional video encoder is improved.
  • each channel of the feature map corresponds to a frame in the input data of a traditional video encoder.
  • the height and width of the feature map are padded to meet the input requirements of the video encoder.
  • the feature map channel order is stored in the repacking order list (repack_order_list) about the feature map, where the content in the repack_order_list can default to the default order array (for example, [0, 1, 2, 3...]).
  • the order of feature channels is not optimally arranged according to the correlation between feature channels, and the order of feature channels in the video codec is not optimized.
  • the reference relationship is used for guidance and design, which makes the coding efficiency between the feature channels after stacking not high.
  • Feature maps are tiled in default order or specified order:
  • multiple channels of feature maps are tiled and spliced into a two-dimensional array as a frame in the input data of a traditional video encoder.
  • the height and width of the concatenated array are padded to meet the input requirements of traditional video encoders.
  • the splicing order is the channel order of the target feature map.
  • the array width direction is first, and the height direction is second. After the current frame is full, create the next frame and continue tiling until all channels of the feature map are tiled.
  • the content can default to the default order array (for example, [0,1,2,3...]).
  • the multi-channel data of the feature is tiled in an image according to a single list order, and the multi-channel data are closely adjacent, which leads to the fact that when using the existing codec methods to encode the tiled image,
  • the block division operation divides the data of multiple channels into the same coding unit. Due to the discontinuity between data of different channels, the correlation of data of different channels in the same coding unit is poor, so that the efficiency of the existing coding and decoding methods cannot be effectively utilized, and the compression effect of feature data is not good enough.
  • the pre-quantized and repackaged feature map array data is sent to the traditional video encoder in the form of YUV video data for compression encoding, and the code stream generated by the traditional video encoder is included in the feature map data stream.
  • the feature map array is input in YUV4:0:0 format; for AVS3 video encoder, the feature map array is input in YUV4:2:0 format.
  • MPEG Immersive Video there is a technology that re-expresses and rearranges the image content captured by each camera at the same time, so that the visual information can be expressed efficiently and effectively.
  • coding Specifically, in the Motion Picture Experts Group immersive video, multiple cameras are placed in a certain positional relationship in the scene to be shot, and these cameras are also called reference viewpoints. There is a certain visual redundancy between the content captured by each reference viewpoint. Therefore, the images of all reference viewpoints need to be re-expressed and re-organized at the encoding end to remove the visual redundancy between viewpoints; at the decoding end, the re-expression and The reorganized information is parsed and restored.
  • the way to re-express the image of the reference viewpoint is to cut out rectangular-shaped sub-block images (Patch) of different sizes from the image of the reference viewpoint. After cutting out all necessary sub-block images, sort these sub-block images from large to small.
  • the sub-block images are placed one by one on an image with a larger resolution to be filled, and the image to be filled is called an Atlas.
  • the upper left pixel of each sub-block image must fall on the upper left pixel of the divided 8*8 image block in the image to be filled.
  • the pixels inside the sub-block images placed in the atlas will be sorted according to the order of placing the sub-block images recorded in the sub-block image information list.
  • Rendering is performed one by one to synthesize an image of the viewer's viewpoint.
  • the scheme of re-expression and rearrangement of visual information in the immersive video of the Moving Picture Experts Group is only placed in order according to the strategy of sorting the sub-block image areas from large to small.
  • the texture similarity and spatial position similarity between sub-blocks are not considered, which will lead to the fact that the reorganized atlas images cannot give full play to the existing encoding and decoding methods when they are sent to traditional video codecs. efficiency.
  • the spatiotemporal splicing method of feature data for similarity measurement is based on the visual geometry group (Visual Geometry Group, VGG) under the image recognition task and the multi-channel output of the intermediate layer of the residual ResNet network.
  • VGG Visual Geometry Group
  • the feature data is compressed and encoded by multiplexing the existing video coding standard H.265/HEVC, and the coding efficiency can be improved by an average of 2.27% compared with the spatial arrangement method.
  • the feature data output from a specific level is currently spliced into two frames in channel order, the mean square error (MSE) is used to measure the similarity between the two frames, and the feature data channels of the two frames are exchanged iteratively. And calculate the similarity between the two frames, and finally obtain an arrangement with the largest similarity between the two frames, and transmit the list corresponding to the target channel sequence and the new channel sequence to the decoding end.
  • MSE mean square error
  • the feature data arrangement of the target is recovered by using the list corresponding to the target channel and the new channel arrangement order, and then input to the subsequent task network to continue inference analysis.
  • the similarity is maximized by exchanging the feature data channels between the two frames.
  • the correlation between the feature data channels in the same frame is not considered, and the multiple The arrangement method at frame time makes the feature data not make full use of the correlation between different channels during encoding to achieve the best encoding efficiency.
  • the present disclosure provides a technology for sorting, splicing, encoding and decoding in the spatiotemporal domain.
  • the basic idea of this technology is: in the preprocessing stage, the multi-channel feature data output by the middle layer of the neural network is sorted, and the feature data of each channel is spliced into multi-frame features in a specific way in the time and space domain according to the sorted order. frame sequence.
  • the feature frame sequence is encoded with the optimized inter-frame reference structure, and the key information of the preprocessing is encoded to obtain the final code stream.
  • the decoding stage from the received code stream, the reconstructed feature frame sequence and the reconstructed preprocessing key information are obtained by parsing.
  • the post-processing stage according to the reconstructed preprocessing key information, the reconstructed feature frame sequence is post-processed to obtain the reconstructed feature data, and the reconstructed feature data is used for the subsequent network for further task reasoning analysis.
  • An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 5 , the method includes the following steps:
  • Step 501 Acquire feature data of multiple channels corresponding to the image to be processed.
  • step 501 to obtain feature data of multiple channels corresponding to the image to be processed may be implemented by the following steps: obtaining the image to be processed; and extracting features from the image to be processed through a neural network model to obtain feature data of multiple channels.
  • the encoder after acquiring the to-be-processed image, the encoder inputs the to-be-processed image into the neural network model, and then acquires the feature data of each channel output by the middle layer of the neural network model.
  • each channel of the image is each feature map of the image, one channel is the detection of a certain feature, and the strength of a certain value in the channel is the response to the strength of the current feature.
  • Step 502 Determine the feature data of the reference channel among the feature data of the multiple channels.
  • the feature data of the reference channel may be the feature data of any channel among the feature data of multiple channels.
  • the purpose of determining the feature data of the reference channel is to determine a sorting start object when sorting the feature data of multiple channels subsequently.
  • Step 503 Take the feature data of the reference channel as the sorting start object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
  • the feature data of the reference channel when the feature data of the reference channel is determined, the feature data of the reference channel is used as the starting object for sorting, and the feature data of the multiple channels are sorted in the order of decreasing similarity between the feature data of the multiple channels. Sorting the data, that is, sorting the feature data of all channels according to the order of similarity from large to small compared with the feature data of the reference channel, and obtaining the feature data of the sorted channels. It should be noted that the correlation of feature data between adjacent channels in the spatiotemporal domain after sorting is relatively large.
  • Step 504 splicing the sorted feature data of multiple channels to obtain a target feature frame sequence.
  • the feature data of multiple channels is sorted according to the similarity, and then in the time domain and the space domain according to the sorting order, Alternatively, the target feature frame sequence can be arranged in the spatial domain, so that feature data with high similarity in adjacent regions can be referred to during subsequent coding, thereby improving the coding efficiency of the feature data.
  • Step 505 Encode the target feature frame sequence to generate a code stream.
  • the inter-frame coding technology can be better used.
  • the feature data is encoded, and if the splicing is performed in the spatial domain and then in the time domain, the feature data can be better encoded using the intra-frame coding technology, so that the technology in the existing video coding standards can be reused. Coding efficiently.
  • the encoding method of feature data provided by the embodiment of the present disclosure, by acquiring feature data of multiple channels corresponding to the image to be processed; determining feature data of a reference channel among the feature data of multiple channels; sorting by feature data of the reference channel For the starting object, sort the feature data of multiple channels in the order of decreasing similarity between the feature data of multiple channels to obtain the sorted feature data of multiple channels; put the sorted feature data of multiple channels Perform splicing to obtain a target feature frame sequence; encode the target feature frame sequence to generate a code stream; that is, in the present disclosure, in the case of obtaining feature data of multiple channels, the feature data of one channel is used as a benchmark, that is, to determine The feature data of the reference channel; the feature data of all channels are sorted according to the order of similarity compared with the feature data of the reference channel; in this way, between the feature data of adjacent channels in the space-time domain after sorting
  • the correlation is large, so that the feature data channel with higher similarity in adjacent regions can be referred to in subsequent
  • An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 6 , the method includes the following steps:
  • Step 601 Acquire feature data of multiple channels corresponding to the image to be processed.
  • Step 602 When the accumulated sum of the feature data values in the feature data of the multiple channels satisfies the target threshold, determine that the feature data of the channel corresponding to the accumulated sum is the feature data of the reference channel.
  • the cumulative sum of the characteristic data values satisfying the target threshold includes: the cumulative sum of the characteristic data values is the largest, or the cumulative sum of the characteristic data values is the smallest.
  • the coding efficiency can be improved.
  • Step 603 taking the feature data of the reference channel as the sorting start object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
  • the similarity between the feature data of the remaining channels and the feature data of the current channel can be determined based on an iterative algorithm. Difference, SAD) and/or mean squared error (Mean Squared Error, MSE); thus, the feature data of a channel with the largest similarity is selected as the feature data of the next channel after sorting.
  • SAD Difference
  • MSE mean squared Error
  • Step 604 Obtain the channel sequence correspondence between the original channel sequence of the feature data of the multiple channels in the image to be processed and the sequence of the encoded channels in the sorted feature data of the multiple channels.
  • the encoding channel order refers to the channel order of the sorted feature data of each channel.
  • the sorted channel order is called the encoding channel order.
  • the channel order correspondence before and after sorting can be stored in the form of a sorted list channel_idx.
  • Sorted lists can exist in various forms, including but not limited to: one-dimensional lists, two-dimensional lists, and three-dimensional lists.
  • the original channel sequence is the Xth channel
  • the corresponding encoding channel sequence is the Ith channel
  • X may be the channel order corresponding to the feature data of the I-th channel after sorting and before sorting.
  • the correspondence between the original channel order and the encoding channel order includes: when the number of time-domain splicing frames is at least two frames, the original channel order is the Xth channel, and the corresponding encoding channel order is the Xth channel. N I-th channel.
  • X may be the channel sequence corresponding to the feature data of the I-th channel of the N-th frame after sorting and before sorting.
  • the correspondence between the original channel order and the encoding channel order includes: when the number of time-domain splicing frames is at least two frames, the original channel order is the Xth channel, and the corresponding encoding channel order is the Xth channel.
  • X may be the channel order corresponding to the sorting of the feature data of the I-th channel in the M-th region of the N-th frame after sorting.
  • Step 605 splicing the sorted feature data of multiple channels to obtain a target feature frame sequence.
  • the sorted feature data are spliced in a time-space domain according to a specific splicing method, and spliced into a target feature frame sequence with a time-domain splicing frame number of frame_count in the time domain.
  • the number of frames spliced in the time domain is the number of frames obtained after splicing the feature data of the sorted multiple channels in the time domain set by the encoder.
  • the encoder can flexibly set the number of time-domain splicing frames according to actual needs.
  • the feature data after splicing is the feature data of row, row, col, column, and channels, and the number of channels of the feature data is C, if:
  • the vacant feature data channel can be filled in the last frame to fill one frame for encoding.
  • Step 606 Encode the target feature frame sequence, generate a code stream, and write the channel sequence correspondence into the code stream.
  • An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 7 , the method includes the following steps:
  • Step 701 Acquire feature data of multiple channels corresponding to the image to be processed.
  • Step 702 Determine the feature data of the reference channel among the feature data of the multiple channels.
  • Step 703 Using the feature data of the reference channel as the sorting start object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
  • Step 704 determining that the number of spliced frames in the time domain is greater than one frame, and splicing the sorted feature data according to the splicing strategy in the temporal and spatial domain to obtain a target feature frame sequence.
  • step 704 it is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the time and space domain to obtain the target feature frame sequence. Steps to achieve:
  • Step 801 It is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the temporal and spatial domain to obtain the spliced feature data.
  • Step 802 Determine the product of the number of rows of the spliced feature data, the number of columns of the spliced feature data, and the number of time-domain spliced frames.
  • Step 803 Determine that the number of channels of the feature data of the multiple channels is less than the product, and fill in the area lacking the feature data channel in the spliced frame to obtain the target feature frame sequence.
  • the region lacking the feature data channel in the spliced frame is filled, that is, the region lacking the feature data channel in the spliced feature frame sequence is filled, so as to improve the coding efficiency.
  • the region lacking the feature data channel may be the region in the last frame in the spliced feature frame sequence.
  • the region lacking the feature data channel may also be a region in at least one frame different from the last frame in the spliced feature frame sequence.
  • step 704 it is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the temporal and spatial domains to obtain the target feature frame sequence, which can be obtained through the following steps: Steps to achieve:
  • Step 901 Determine that the number of spliced frames in the time domain is greater than one frame, and perform splicing at the same position of different frames in the time domain according to the splicing strategy of the time domain first and then the spatial domain in the temporal domain according to the raster scanning order.
  • Step 902 Perform splicing at adjacent positions in the spatial domain according to the raster scanning order, or stitching at adjacent positions in the spatial domain according to the zigzag scanning order.
  • splicing is performed first in the time domain and then in the spatial domain, so that the feature data can be encoded using the inter-frame coding technology, so that the technology in the existing video coding standards can be reused to efficiently encode the feature data.
  • step 704 it is determined that the number of spliced frames in the time domain is greater than one frame, and the sorted feature data is spliced according to the splicing strategy in the temporal and spatial domain to obtain the target feature frame sequence, which can be obtained through the following steps: Steps to achieve:
  • Step 1001 Determine that the number of splicing frames in the time domain is greater than one frame.
  • Step 1001 Determine that the number of splicing frames in the time domain is greater than one frame.
  • the splicing strategy of the first spatial domain and then the time domain splicing in adjacent positions in the spatial domain according to the raster scanning order, or in the spatial domain according to the zigzag scanning order. Splicing at adjacent locations.
  • Step 1002 In the time domain, splicing is performed at the same position of different frames according to the raster scanning sequence.
  • splicing is performed first in the spatial domain and then in the time domain, so that the feature data can be encoded using the intra-frame coding technology better, so that the technology in the existing video coding standards can be reused to efficiently encode the feature data.
  • Step 705 Determine the number of splicing frames in the time domain as one frame, and splicing the sorted feature data in the spatial domain according to the splicing strategy to obtain a target feature frame sequence.
  • Step 706 Encode the target feature frame sequence to generate a code stream.
  • Step 707 Write the time-domain splicing frame number, the number of channels corresponding to the feature data of multiple channels, the height of the feature data of a single channel, and the width of the feature data of a single channel into the code stream.
  • the raster scan splicing will be further described. Taking the splicing into a video sequence with a total number of 4 frames as an example, the schematic diagram of raster scan splicing is shown in Figure 11.
  • the sorting order of the sorted feature data includes but not limited to:
  • stitching is performed at the same position in different frames according to the raster scanning order in the time domain, and secondly, stitching is performed at adjacent positions according to the raster scanning order in the spatial domain;
  • stitching is performed at adjacent positions in the spatial domain according to the raster scanning order, and secondly, stitching is performed at the same position in different frames according to the raster scanning order in the time domain.
  • the zigzag scanning splicing is further explained, taking the splicing into a video sequence with a total number of 4 frames as an example, the schematic diagram of zigzag splicing is shown in Figure 12, and the sorting order of the feature data after sorting is shown in Figure 12. Including but not limited to:
  • stitching is performed at the same position in different frames according to the raster scanning order in the time domain, and secondly, stitching is performed at adjacent positions according to the zigzag scanning order in the spatial domain;
  • splicing is performed at adjacent positions according to the zigzag scanning order, and in the time domain, splicing is performed at the same position in different frames according to the raster scanning order.
  • the additional information is also referred to as feature data spatiotemporal arrangement information: the number of channels of feature data C, the height of the feature data of a single channel h, the width w of the feature data of a single channel, the sorted list channel_idx, and the frame_count of the time-domain stitching frames.
  • An embodiment of the present disclosure provides a method for encoding feature data, which is applied to an encoder; with reference to FIG. 13 , the method includes the following steps:
  • Step 1101 Acquire feature data of multiple channels corresponding to the image to be processed.
  • Step 1102 Determine the feature data of the reference channel among the feature data of the multiple channels.
  • Step 1103 taking the feature data of the reference channel as the sorting starting object, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted feature data of the multiple channels .
  • Step 1104 splicing the sorted feature data in the airspace according to the strategy of filling first and then splicing.
  • step 1104 splices the sorted feature data in the airspace according to the strategy of filling first and then splicing, which can be achieved by the following steps: filling each sorted feature data in the airspace, The feature data after filling is spliced on the above; wherein, there is a gap between the feature data of adjacent channels after filling.
  • filling each sorted feature data in the airspace includes: filling between feature data of adjacent channels to ensure that there is a gap between the filled feature data of adjacent channels.
  • the size of the gap between the feature data of adjacent channels may be the same.
  • the distance between each small box and each dashed box is the same.
  • the feature data of adjacent channels is filled, which reduces the mutual influence of values between different channels, and improves the signal fidelity of the channel boundary.
  • Step 1105 Encode the target feature frame sequence to generate a code stream.
  • Step 1106 write the time domain splicing frame number, the height of the feature data after filling and the width of the feature data after filling into the code stream, and write the number of channels corresponding to the feature data of multiple channels and the height of the feature data of a single channel. And the width of the feature data of a single channel is written into the code stream.
  • the spatiotemporal arrangement information of the feature data further includes: the height of the filled feature data and the width of the filled feature data.
  • step 1104 according to the strategy of first filling and then splicing in the airspace, the scheme of splicing the sorted feature data is also applicable to step 901, step 1001 and step 705;
  • the process it is determined that the number of spliced frames in the time domain is greater than one frame, and in the temporal and spatial domain, according to the splicing strategy of the time domain first and then the spatial domain, in the temporal domain, splicing is performed in the same position of different frames according to the raster scan order;
  • the strategy of post-splicing is to fill in each sorted feature data in the airspace, and splicing the filled feature data in the airspace at adjacent positions in the raster scan order in the airspace, or in the airspace according to the zigzag shape.
  • the scan order is stitched at adjacent positions.
  • step 1001 it is determined that the number of spliced frames in the time domain is greater than one frame, and based on the strategy of filling first and then splicing in the air domain, each sorted feature data is filled in the air domain, and the filling is performed in the air domain.
  • the resulting feature data is spliced in adjacent positions in the spatial domain according to the raster scan order, or stitched in the adjacent positions in the spatial domain according to the zigzag scanning order, and then in the time domain according to the raster scan order in different frames. position for splicing.
  • step 705 it is determined that the number of spliced frames in the time domain is one frame, and based on the strategy of filling first and then splicing in the air domain, each sorted feature data is filled in the air domain, and the filling is performed in the air domain. After the feature data, according to the splicing strategy in the airspace, the sorted feature data is spliced to obtain the target feature frame sequence.
  • the feature data spatiotemporal arrangement information may be recorded in supplemental enhancement information (eg, Supplemental Enhancement Information (SEI) of the existing video coding standard H.265/HEVC, H.266/VVC, or extended data of the AVS standard (Extension Data)).
  • SEI Supplemental Enhancement Information
  • the sei_rbsp() of the existing video coding standard AVC/HEVC/VVC/EVC the sei_paylod() of sei_message() adds a new SEI category, namely Feature data quantization SEI message, and payloadType can be defined as any Other SEI numbers have not been used, such as 183.
  • the syntax structure is shown in Table 1.
  • the sorted list is a one-dimensional sorted list, its grammatical structure is:
  • syntax elements can be encoded in different efficient entropy coding methods, where the syntax elements are:
  • feature_channel_count The number of channels used to describe feature data is feature_channel_count
  • feature_frame_count used to describe the number of frames after feature data splicing is feature_frame_count
  • feature_single_channel_height The height used to describe the feature data of a single channel is feature_single_channel_height;
  • feature_single_channel_width The width of the feature data used to describe a single channel is feature_single_channel_width;
  • channel_idx[I] used to describe the channel order channel_idx[I] corresponding to the feature data of the I-th channel after sorting.
  • An embodiment of the present disclosure provides a method for decoding feature data, which is applied to a decoder; with reference to FIG. 15 , the method includes the following steps:
  • Step 1201 Parse the code stream to obtain the reconstructed feature frame sequence.
  • Step 1202 Reversely sort the reconstructed feature frame sequence to obtain feature data of the reconstructed multiple channels.
  • the decoding method provided by the embodiment of the present disclosure obtains the reconstructed feature frame sequence by parsing the code stream, and reversely sorts the reconstructed feature frame sequence to obtain the feature data of the reconstructed multiple channels, so that the spatiotemporal sequence can be accurately recovered.
  • the feature data of the previous multiple channels is used for the subsequent network for further task inference analysis.
  • An embodiment of the present disclosure provides a method for decoding feature data, which is applied to a decoder; with reference to FIG. 16 , the method includes the following steps:
  • Step 1301 Parse the code stream to obtain the reconstructed feature frame sequence, the channel sequence correspondence, the number of channels, the number of time-domain spliced frames, the height of the feature data of a single channel, and the width of the feature data of a single channel.
  • Step 1302 Determine the location of the feature data of each channel in the reconstructed feature frame sequence based on the number of channels, the number of time-domain splicing frames, the height of the feature data of a single channel, and the width of the feature data of a single channel.
  • Step 1303 based on the channel sequence correspondence, determine the original channel sequence of the feature data at different positions in the reconstructed feature frame sequence.
  • Step 1304 based on the original channel sequence, reversely sort the feature data at different positions in the reconstructed feature frame sequence to obtain the feature data of the reconstructed multiple channels.
  • the decoding end after decoding the reconstructed feature frame sequence and the reconstructed feature data spatiotemporal arrangement information, performs a spatiotemporal inverse arrangement operation on the reconstructed feature frame sequence to obtain the reconstructed feature data.
  • the steps are as follows:
  • the frame_count of the time-domain splicing frame, and the height h of the feature data of a single channel and the width w of the feature data of a single channel determine each feature frame sequence in the feature frame sequence. The location of the feature data of the channel;
  • An embodiment of the present disclosure provides a method for decoding feature data, which is applied to a decoder; with reference to FIG. 17 , the method includes the following steps:
  • Step 1401 Parse the code stream to obtain the reconstructed feature frame sequence, the channel sequence correspondence, the number of channels, the number of time-domain stitching frames, the height of the filled feature data, the width of the filled feature data, and the size of the feature data of a single channel. Height and width of feature data for a single channel.
  • Step 1402 Determine the reconstructed feature frame based on the number of channels, the number of time-domain stitching frames, the height of the filled feature data, the width of the filled feature data, the height of the feature data of a single channel, and the width of the feature data of a single channel The location of the feature data for each channel in the sequence.
  • Step 1403 based on the channel sequence correspondence, determine the original channel sequence of the feature data at different positions in the reconstructed feature frame sequence.
  • Step 1404 based on the original channel sequence, reversely sort the feature data at different positions in the reconstructed feature frame sequence to obtain the reconstructed feature data of multiple channels.
  • the present disclosure has at least the following beneficial effects: based on the information redundancy between different channels of the multi-channel feature data output by the intermediate layer of the neural network, all channels of the multi-channel feature data are sorted according to similarity, and then sorted in the time domain and The feature frame sequence is arranged in the space domain, so that the feature data channel with high similarity in the adjacent area can be referred to during encoding, and the encoding efficiency of the feature data can be improved. If the splicing is performed first in the time domain and then in the spatial domain, the inter-frame coding technology can be used to encode the feature data. The feature data is encoded so that techniques in existing video coding standards can be reused to efficiently encode the feature data.
  • the present disclosure encodes the multi-channel feature data output by the middle layer of the neural network. Arranged into a sequence of feature frames. Since the correlation between adjacent channels in the temporal and spatial domains after arrangement is relatively large, the present disclosure can better utilize the existing intra-frame prediction and inter-frame prediction, and further improves the coding efficiency of the feature data. In order to restore the multi-channel feature data before the spatiotemporal arrangement after decoding, it is necessary to record the spatiotemporal arrangement information of the feature data in the code stream.
  • FIG. 18 is a schematic diagram of the composition and structure of an encoding device provided by an embodiment of the present disclosure.
  • the encoding device 150 includes a first obtaining unit 1501, a first processing unit 1502, and an encoding unit 1503, wherein:
  • the first obtaining unit 1501 is configured to obtain feature data of multiple channels corresponding to the image to be processed
  • a first processing unit 1502 configured to determine the feature data of the reference channel in the feature data of the multiple channels
  • the first processing unit 1502 is configured to take the feature data of the reference channel as the starting object for sorting, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted multi-channel feature data. characteristic data of each channel;
  • the first processing unit 1502 is configured to splicing the feature data of the sorted multiple channels to obtain a target feature frame sequence
  • the encoding unit 1503 is configured to encode the target feature frame sequence to generate a code stream.
  • the first processing unit 1502 is configured to, when the accumulated sum of the feature data values in the feature data of multiple channels satisfies the target threshold, determine that the feature data of the channel corresponding to the accumulated sum is the feature data of the reference channel .
  • the cumulative sum of the characteristic data values satisfying the target threshold includes: the cumulative sum of the characteristic data values is the largest, or the cumulative sum of the characteristic data values is the smallest.
  • the first obtaining unit 1501 is configured to obtain the original channel sequence of the feature data of the multiple channels in the image to be processed, and the sequence of the encoded channels in the sorted feature data of the multiple channels. Correspondence between the channel sequence;
  • the encoding unit 1503 is configured to write the channel sequence correspondence into the code stream.
  • the channel sequence correspondence includes:
  • the original channel sequence is the Xth channel
  • the corresponding encoding channel sequence is the Ith channel
  • the original channel sequence is the Xth channel
  • the corresponding encoding channel sequence is the Nth Ith channel.
  • the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, and splices the sorted feature data according to the splicing strategy in the spatiotemporal domain to obtain the target feature frame sequence.
  • the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, and according to the splicing strategy in the temporal and spatial domains, splicing the sorted feature data to obtain the spliced feature data;
  • the number of channels of the feature data of the multiple channels is less than the product, and the area lacking the feature data channel in the spliced frame is filled to obtain the target feature frame sequence.
  • the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, in the temporal and spatial domain according to the splicing strategy of the time domain first and then the spatial domain, in the temporal domain according to the raster scanning order in different The same position of the frame is stitched;
  • the splicing is performed at adjacent positions in the spatial domain according to the raster scan order, or in the adjacent positions according to the zigzag scanning order in the spatial domain.
  • the first processing unit 1502 is configured to determine that the number of spliced frames in the time domain is greater than one frame, and in the temporal and spatial domain, according to the splicing strategy of the spatial domain first and then the temporal domain, in the spatial domain according to the raster scanning order in adjacent splicing at adjacent positions in the airspace according to the zigzag scanning order;
  • stitching is performed at the same position in different frames according to the raster scan order.
  • the first processing unit 1502 is configured to determine the number of splicing frames in the time domain as one frame, and splicing the sorted channel feature data in the spatial domain according to the splicing strategy to obtain the target feature frame sequence.
  • the first processing unit 1502 is configured to splicing the sorted channel feature data in the airspace according to a strategy of filling first and then splicing.
  • the first processing unit 1502 is configured to fill each sorted feature data in the airspace, and splicing the filled feature data in the airspace; wherein, the filled adjacent channels There are gaps between the feature data.
  • the encoding unit 1503 is configured to write the height of the padded feature data and the width of the padded feature data into the code stream. Write the number of channels corresponding to the feature data of multiple channels, the height of the feature data of a single channel, and the width of the feature data of a single channel into the code stream. Write the time-domain splicing frame number into the code stream.
  • the first obtaining unit 1501 is configured to obtain an image to be processed
  • the first processing unit 1502 is configured to perform feature extraction on the image to be processed through a neural network model to obtain feature data of multiple channels.
  • FIG. 19 is a schematic diagram of the composition and structure of a decoding device provided by an embodiment of the present disclosure.
  • the decoding device 160 includes a decoding unit 1601 and a second processing unit 1602, wherein:
  • the decoding unit 1601 is configured to parse the code stream to obtain the reconstructed feature frame sequence
  • the second processing unit 1602 is configured to reversely sort the reconstructed sequence of feature frames to obtain reconstructed feature data of multiple channels.
  • the decoding unit 1601 is configured to parse the code stream to obtain the channel sequence correspondence, the number of channels, the number of time-domain splicing frames, the height of the feature data of a single channel, and the width of the feature data of a single channel.
  • the second processing unit 1602 is configured to determine the position of the feature data of each channel in the reconstructed sequence of feature frames based on the number of channels, the number of time-domain splicing frames, the height of the feature data of a single channel, and the width of the feature data of a single channel ; Based on the channel sequence correspondence, determine the original channel sequence of the feature data at different positions in the reconstructed feature frame sequence; Based on the original channel sequence, reverse sort the feature data at different positions in the reconstructed feature frame sequence to obtain multiple reconstructed Characteristic data for the channel.
  • the decoding unit 1601 is configured to parse the code stream to obtain the height of the padded feature data and the width of the padded feature data;
  • the second processing unit 1602 is configured to be based on the number of channels, the number of time-domain splicing frames, the height of the filled feature data, the width of the filled feature data, the height of the feature data of a single channel, and the width of the feature data of a single channel, Determine the position of the feature data of each channel in the reconstructed feature frame sequence.
  • FIG. 20 is a schematic diagram of the composition and structure of an encoding device provided by an embodiment of the present disclosure.
  • the encoding device 170 (the encoding device 170 in FIG. 20 corresponds to the encoding device 150 in FIG. 18 ) includes a first memory 1701 and The first processor 1702, wherein:
  • the first processor 1702 is configured to implement the encoding method provided by the embodiment of the present disclosure when executing the encoding instruction stored in the first memory 1701 .
  • the first processor 1702 may be implemented by software, hardware, firmware or a combination thereof, and may use circuits, single or multiple application specific integrated circuits (ASIC), single or multiple general-purpose integrated circuits, single or multiple A microprocessor, single or multiple programmable logic devices, or a combination of the foregoing circuits or devices, or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing encoding methods.
  • ASIC application specific integrated circuits
  • a microprocessor single or multiple programmable logic devices
  • a combination of the foregoing circuits or devices or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing encoding methods.
  • FIG. 21 is a schematic structural diagram of a decoding device provided by an embodiment of the present disclosure.
  • the decoding device 180 (the decoding device 180 in FIG. 21 corresponds to the decoding device 160 in FIG. 19 ) includes a second memory 1801 and The second processor 1802, wherein:
  • the second processor 1802 is configured to implement the decoding method provided by the embodiment of the present disclosure when executing the decoding instruction stored in the second memory 1801.
  • the second processor 1802 may be implemented by software, hardware, firmware or a combination thereof, and may use circuits, single or multiple application specific integrated circuits (ASIC), single or multiple general-purpose integrated circuits, single or multiple A microprocessor, single or multiple programmable logic devices, or a combination of the foregoing circuits or devices, or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing decoding methods.
  • ASIC application specific integrated circuits
  • a microprocessor single or multiple programmable logic devices
  • a combination of the foregoing circuits or devices or other suitable circuits or devices, so that the processor can perform the corresponding steps of the foregoing decoding methods.
  • Each component in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of software function modules.
  • the integrated unit is implemented in the form of software function modules and is not sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of this embodiment is essentially or correct. Part of the contribution made by the prior art or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can be a personal A computer, a cloud server, or a network device, etc.) or a processor (processor) executes all or part of the steps of the method in this embodiment.
  • a computer device which can be a personal A computer, a cloud server, or a network device, etc.
  • a processor processor
  • the aforementioned storage media include: magnetic random access memory (FRAM, ferromagnetic random access memory), read only memory (ROM, Read Only Memory), programmable read only memory (PROM, Programmable Read-Only Memory), erasable memory Programmable Read-Only Memory (EPROM, Erasable Programmable Read-Only Memory), Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), Flash Memory (Flash Memory), Magnetic Surface Memory, Optical Disc , or various media that can store program codes, such as a CD-ROM (Compact Disc Read-Only Memory), the embodiment of the present disclosure is not limited.
  • FRAM magnetic random access memory
  • ROM read only memory
  • PROM programmable Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • Flash Memory Flash Memory
  • magnetic Surface Memory Optical Disc
  • CD-ROM Compact Disc Read-
  • Embodiments of the present disclosure further provide a computer-readable storage medium storing executable coding instructions for causing the first processor to execute the coding method provided by the embodiments of the present disclosure.
  • Embodiments of the present disclosure further provide a computer-readable storage medium storing executable decoding instructions, which are configured to implement the decoding method provided by the embodiments of the present disclosure when the second processor is executed.
  • the embodiments of the present disclosure provide an encoding method, a decoding method, an encoder, a decoder, and a storage medium for feature data.
  • a reference in the feature data of multiple channels is determined;
  • the feature data of the channel take the feature data of the reference channel as the starting object for sorting, sort the feature data of the multiple channels in the order of decreasing similarity between the feature data of the multiple channels, and obtain the sorted data of the multiple channels.
  • the feature data of one channel is used as the benchmark, that is, the feature data of the reference channel is determined; compared with the feature data of the reference channel, the feature data of all channels are sorted in descending order of similarity; After sorting, the correlation between the feature data of adjacent channels in the spatiotemporal domain is relatively large, so that the feature data channel with higher similarity in the adjacent region can be referred to in the subsequent coding, thereby improving the coding efficiency of the feature data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本公开实施例提供了一种特征数据的编码方法、解码方法、编码器、解码器及存储介质,其中,编码方法包括:获取待处理图像对应的多个通道的特征数据;确定所述多个通道的特征数据中的参考通道的特征数据;以所述参考通道的特征数据为排序起始对象,按照所述多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;对所述目标特征帧序列进行编码,生成码流。

Description

特征数据的编码方法、解码方法、设备及存储介质 技术领域
本公开实施例涉及通信领域中的编解码技术,尤其涉及一种特征数据的编码方法、解码方法、编码器、解码器及存储介质。
背景技术
目前,在传统视频编解码的过程中,特征图编解码流程包含三个主要模块:预量化/反预量化、重打包/反重打包、传统视频编码/解码。经过预量化、重打包后的特征图数组数据以亮度色度(YUV)视频数据形式送入传统视频编码器进行压缩编码,传统视频编码器产生的码流包含在特征图数据码流中。其中,重打包/反重打包有多个模式可选,分别为特征图指定顺序叠加、特征图默认顺序或指定顺序平铺。
然而,在叠加模式中,仅适用单一的列表描述特征通道的顺序,没有对特征通道之间在视频编解码器中的参考关系进行指导和设计,这使得叠加之后的特征通道之间的编码效率并不高。在平铺模式中,特征的多通道数据被按照单一的列表顺序平铺在一幅图像中,多通道数据紧密相邻,这就导致在使用现有特征数据的处理方法对平铺图像进行编码时,块划分操作会将多个通道的数据划分到同一个编码单元中。由于不同通道数据之间存在非连续性,这就使得同一个编码单元中的不同通道数据的相关性较差,从而不能有效发挥现有特征数据的处理方法的效率。
由此可知,相关技术中基于特征数据进行编码时至少存在编码效率低的问题。
发明内容
本公开实施例提供了一种特征数据的编码方法、解码方法、编码器、解码器及存储介质,通过对所有通道特征数据进行排序,排序后时空域上相邻通道之间的相关性较大,使得后续编码时可以参考相邻区域相似度较高的特征数据通道,从而提高了特征数据的编码效率。
本公开实施例的技术方案可以如下实现:
第一方面,本公开实施例提供了一种特征数据的编码方法,包括:
获取待处理图像对应的多个通道的特征数据;
确定所述多个通道的特征数据中的参考通道的特征数据;
以所述参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;
将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;
对所述目标特征帧序列进行编码,生成码流。
第二方面,本公开实施例还提供了一种特征数据的解码方法,包括:
解析码流,获得重建的特征帧序列;
对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
第三方面,本公开实施例提供了一种编码器,所述编码器包括第一获得单元、第一处理单元和编码单元;其中,
所述第一获得单元,配置为获取待处理图像对应的多个通道的特征数据;
所述第一处理单元,配置为确定所述多个通道的特征数据中的参考通道的特征数据;
所述第一处理单元,配置为以所述参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;
所述第一处理单元,配置为将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;
所述编码单元,配置为对所述目标特征帧序列进行编码,生成码流。
第四方面,本公开实施例提供了一种解码器,所述解码器包括解码单元和第二处理单元;其中,
所述解码单元,配置为解析码流,获得重建的特征帧序列;
所述第二处理单元,配置为对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
第五方面,本公开实施例提供了一种编码器,包括:
所述第一存储器,用于存储能够在所述第一处理器上运行的计算机程序;
所述第一处理器,用于在运行所述计算机程序时,执行如第一方面所述的编码方法。
第六方面,本公开实施例还提供了一种解码器,包括:
所述第二存储器,用于存储能够在所述第二处理器上运行的计算机程序;
所述第二处理器,用于在运行所述计算机程序时,执行如实现第二方面所述的解码方法。
第七方面,本公开实施例提供了一种计算机可读存储介质,存储有可执行编码指令,用于引起第一处理器执行时,实现第一方面所述的编码方法。
第八方面,本公开实施例提供了一种计算机可读存储介质,存储有可执行解码指令,用于引起第二处理器执行时,实现第二方面所述的解码方法。
本公开实施例提供了一种特征数据的编码方法、解码方法、编码器、解码器及存储介质,其中,特征数据的编码方法通过获取待处理图像对应的多个通道的特征数据;确定多个通道的特征数据中的参考通道的特征数据;以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;对目标特征帧序列进行编码,生成码流;也就是说,本公开在获得多个通道的特征数据的情况下,以一个通道的特征数据作为基准,即确定参考通道的特征数据;按照与参考通道的特征数据相比,相似度由大到小的顺序,对所有通道的特征数据进行排序;如此,在排序后时空域上相邻通道的特征数据之间的相关性较大,使得后续编码时可以参考相邻区域相似度较高的特征数据通道,从而提高了特征数据的编码效率。
附图说明
图1为本公开实施例提供的“预分析再压缩”框架示意图;
图2为本公开实施例提供的相关技术中的编码流程示意图;
图3为本公开实施例提供的相关技术中的编码流程示意图;
图4为本公开实施例提供的相关技术中的时空拼接示意图;
图5为本公开实施例提供的示例性的特征数据的编码方法的流程示意图一;
图6为本公开实施例提供的示例性的特征数据的编码方法的流程示意图二;
图7为本公开实施例提供的示例性的特征数据的编码方法的流程示意图三;
图8为本公开实施例提供的示例性的特征数据的编码方法的流程示意图四;
图9为本公开实施例提供的示例性的特征数据的编码方法的流程示意图五;
图10为本公开实施例提供的示例性的特征数据的编码方法的流程示意图六;
图11为本公开实施例提供的光栅扫描拼接示意图;
图12为本公开实施例提供的Z字形扫描拼接示意图;
图13为本公开实施例提供的示例性的特征数据的编码方法的流程示意图七;
图14为本公开实施例提供的空域相邻通道的特征数据之间的填充示意图;
图15为本公开实施例提供的示例性的特征数据的解码方法的流程示意图一;
图16为本公开实施例提供的示例性的特征数据的解码方法的流程示意图二;
图17为本公开实施例提供的示例性的特征数据的解码方法的流程示意图三;
图18为本公开实施例提供的一种编码器的结构示意图;
图19为本公开实施例提供的一种解码器的结构示意图;
图20为本公开实施例提供的一种编码器的结构示意图;
图21为本公开实施例提供的一种解码器的结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开的具体技术方案做进一步详细描述。以下实施例用于说明本公开,但不用来限制本公开的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本公开的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本公开实施例的目的,不是旨在限制本公开。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
需要指出,本公开实施例所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本公开实施例能够以除了在这里图示或描述的以外的顺序实施。
对本公开实施例进行进一步详细说明之前,对本公开实施例中涉及的名词和术语进行说明,本公开实施例中涉及的名词和术语适用于如下的解释。
1)三维特征数据张量(3D Feature Data Tensor)包括通道数(Channel,C),高度(Height,H),宽度(Width,W)。
2)特征数据,指的是神经网络中间层输出的数据(the output data at the intermediate layer of neural networks)。
面向智能分析的应用场景下,视频及图像除了需要呈现给用户高质量地观看以外,还更多地被用于分析理解其中的语义信息。针对智能分析任务对视频及图像编码更为独特的分析需求,目前由传统的直接对图像进行压缩编码,转为对智能分析任务网络中间层输出的特征数据进行压缩编码。
摄像头等端侧设备首先对采集或输入得到的原始视频及图像数据利用任务网络进行预分析,提取得到云端分析足够多的特征数据,并对这些特征数据进行压缩编码和传输。云端设备接收到相应的码流后,根据码流的语法信息重建相应的特征数据,并输入到特定任务网络中继续进行分析,这种“预分析再压缩”框架如图1所示。在这种框架下,端侧设备和云端设备之间存在大量特征数据的传输,特征数据压缩的目的即对于现有任务网络中提取的特征数据以可恢复的方式进行压缩编码,以供云端进一步的智能分析处理。
参见图1所示,在一个应用场景中,例如人脸识别场景,端侧设备如图像采集设备采集到人像后,输入到人脸识别的神经网络中。这里,神经网络包括至少一个神经网络,例如,神经网络包括任务网络A、任务网络B以及任务网络C,这些神经网络可以相同,也可以不同。以神经网络有10层为例,由于图像采集设备本地的算力不够,只能执行5层, 在神经网络的中间层输出原始特征数据后,图像采集设备对原始特征数据进处理,得到符合特征编码装置的数据输入条件的特征数据;进一步的,图像采集设备将符合输入条件的特征数据发送至编码装置,编码装置对符合输入条件的特征数据进行编码并写入码流。之后,编码装置将码流发送至特征解码装置,这里,特征解码装置可以设置在云端设备中如云服务器内。也就是说,端侧设备在得到码流之后,将其交由云服务器进行处理。云服务器通过特征解码装置对码流进行解码并进行重建,得到重建的特征数据;最后,云服务器将重建的各通道对应的特征数据输入至神经网络的第6层,不断执行到第10层,得到识别结果。
针对此类面向智能分析任务场景的视频及图像高效编码问题,运动图像专家组(Moving Picture Experts Group,MPEG)已于2019年7月在第127次会议上成立了机器视频编码(Video Coding for Machines,VCM)标准工作组来研究该方面的技术,旨在针对压缩视频或者从视频中提取的特征信息定义一个码流,使其可以在不显著降低智能任务分析性能的情况下利用同一码流来执行多个智能分析任务,同时解压后的信息对智能分析任务更加友好,在相同码率下智能分析任务性能的损失更小。与此同时,全国信息技术标准化技术委员会下设的多媒体分委会标准工作会议于2020年1月在浙江省杭州市召开了第一次工作组会议,相应成立面向机器智能的数据编码(Data Compression for Machines,DCM)标准工作组来研究该方面的技术应用,旨在通过高效的数据表征与压缩,支撑所涉及到的机器智能应用或人机混合的智能应用。
目前VCM标准工作组设计了相应的潜在编码流程图,如图2所示,以此来提高智能分析任务下视频及图像的编码效率。视频及图像可以直接通过针对任务优化后的视频及图像编码器,也可以利用网络预分析提取特征数据并对其编码,再将解码后的特征数据输入到后续网络中继续分析。若需要复用现有的视频及图像编码标准对提取的特征数据进行压缩,则需要将浮点型表示的特征数据进行定点化的处理,同时将其转换为适合现有编解码标准的输入,例如将多通道的特征数据拼接为单帧或多帧YUV格式的特征帧序列并输入到视频编码器中进行压缩编码。
任务网络中间层输出的特征数据压缩技术在编解码过程中值得深入研究。例如研究一些常用的任务网络不同层级输出的特征数据在无损压缩和有损压缩时的编码效率。通过利用视频编码标准H.265/HEVC的参考软件对特征数据进行压缩编码,本申请人认为在较大的码率区间内,特征数据的信号保真度差异不大,而当码率低于某一阈值时,特征数据的信号保真度急剧减小。再例如,研究利用现有的视频编码标准对特征数据进行有损压缩,并通过将有损压缩引入到网络训练中,提出有损压缩时提高任务精度的策略。
由于复用传统的视频编码标准需要将特征数据转换为YUV格式的特征帧序列。对特征数据的转换方式进行研究发现,将任务网络输出的多通道特征数据按通道顺序在空域上拼接为单帧和多帧的特征帧序列并进行压缩编码。其实验结果表明,对网络浅层输出的特征数据两种方式编码效率较为接近,对于网络深层输出的特征数据,拼接为单帧的编码效率要明显高于多帧特征帧序列。针对特征数据编码效率评价指标的研究,以任务精度作为评价指标在部分情况下压缩后的特征数据可以获得比目标数据更高的性能,因此分别对于图像分类、图像检索及图像识别三种任务建立相应的评价指标。由于任务网络在训练后可能存在过拟合或欠拟合的情况,导致特征数据在码率较高时任务性能可能高于目标性能,而对各个任务分别建立一套评价指标的方法,其普适性较差,因此可以选择合适的码率区间,即不考虑码率过高及任务性能过低的情况衡量特征数据的编码效率。
除此之外,还可以利用神经网络对特征数据降维以达到压缩数据量的目标。
综上,目前对特征数据压缩技术的研究主要存在三大问题:其一是研究的特征数据压缩技术仅面向特殊的应用场景,仅针对少数中间层数据量小于目标输入的任务网络,对于其他大多数任务网络输出的特征数据则表现较差;其二是研究的方法仅考虑了特征数据压 缩后的数据量而未考虑任务质量,例如利用神经网络对特征数据降维,其较难实现高精度任务需求,同时由于未考虑压缩后的任务质量,因而未能对特征数据的压缩进行合适的引导和评价;其三是结合传统视频及图像编码技术对特征数据进行压缩,但并未考虑特征数据与传统视频及图像间的差异,未能高效利用现有视频及图像编码技术以达到较高的编码效率。
进一步地,对相关技术中编解码流程进行说明:
相关技术一、特征数据编解码流程如图3所示,包含三个主要模块:预量化/反预量化;重打包/反重打包;传统视频编码/解码。具体模块内容如下。
预量化/反预量化:当目标输入特征图为浮点型时,需要对特征图进行预量化,使其转化为符合传统视频编码器输入要求的整型数据。
重打包/反重打包:重打包模块将目标特征图三维数组变换为符合传统视频编码器输入要求的YUV格式信息。同时通过改变特征图的组合方式,提高传统视频编码器对特征图数据的编码效率。重打包/反重打包有多个模式可选,分别为特征图指定顺序叠加、特征图默认顺序或指定顺序平铺。
特征图指定顺序叠加:在该模式下,特征图的每个通道对应传统视频编码器输入数据中的一帧。特征图的高、宽被填充至符合统视频编码器输入要求的高度与宽度。特征图通道顺序存进关于特征图的重打包排序列表(repack_order_list)中,其中repack_order_list中的内容可以缺省为默认的顺序数组(例如,[0,1,2,3…])。
在叠加模式中,仅适用单一的列表描述特征通道的顺序,并没有根据特征通道之间的相关性进行特征通道的顺序的最优化排列,且没有对特征通道之间在视频编解码器中的参考关系进行指导和设计,这使得叠加之后的特征通道之间的编码效率并不高。
特征图默认顺序或指定顺序平铺:在该模式下,特征图多个通道平铺拼接成一个二维数组作为传统视频编码器输入数据中的一帧。拼接后的数组的高、宽被填充至符合传统视频编码器输入要求的高度与宽度。拼接顺序为目标特征图通道顺序,由数组宽方向优先,高方向其次依次排列,当前帧铺满后再创造下一帧继续平铺,直到特征图所有通道均平铺完毕。其中通道顺序由repack_order_list记录,内容可以缺省为默认的顺序数组(例如,[0,1,2,3…])。
在平铺模式中,特征的多通道数据被按照单一的列表顺序平铺在一幅图像中,多通道数据紧密相邻,这就导致在使用现有编解码方法对平铺图像进行编码时,块划分操作会将多个通道的数据划分到同一个编码单元中。由于不同通道数据之间存在非连续性,这就使得同一个编码单元中的不同通道数据的相关性较差,从而不能有效发挥现有编解码方法的效率,使得特征数据的压缩效果不够好。
传统视频编码/解码:经过预量化、重打包后的特征图数组数据以YUV视频数据形式送入传统视频编码器进行压缩编码,传统视频编码器产生的码流包含在特征图数据码流中。其中,对于HEVC视频编码器,特征图数组以YUV4:0:0格式输入;对AVS3视频编码器,特征图数组以YUV4:2:0格式输入。
相关技术二、在运动图像专家组沉浸式视频(MPEG Immersive Video)中,存在一种对同一时刻各个相机所拍摄到的图像内容进行重表达和重排列的技术,以便视觉信息的高效表达与高效编码。具体而言,在运动图像专家组沉浸式视频中,多台相机会在所需拍摄的场景中按一定位置关系摆放,这些相机也被称为参考视点。各个参考视点拍摄的内容之间存在一定的视觉冗余,因此在编码端需要对所有参考视点的图像进行重表达和重组织,来去除视点间的视觉冗余;在解码端需要对重表达和重组织后的信息进行解析与还原。
在编码端,对参考视点的图像进行重表达的方式是,在参考视点图像上截取呈矩形形状的大小各异的子块图像(Patch)。截取出所有必要的子块图像后,将这些子块图像由大至小排序。依照上述的顺序,将子块图像逐个摆放在一张待填充的有着较大分辨率的图像 上,这张待填充的图像被称为地图集(Atlas)。在摆放子块图像时,每个子块图像的左上角的像素一定会落在待填充图像中划分好的8*8图像块的左上角像素上。每执行一次子块图像的摆放,就会记下当前摆放的子块图像的摆放序号、左上角像素的坐标、子块图像的分辨率大小,按顺序存进关于子块图像的重打包排序列表中。当所有子块图像摆放完毕之后,我们将会对地图集和子块图像信息列表送进传统视频编解码器进行编码。
在解码端,得到重建后的地图集以及子块图像信息列表后,将按照子块图像信息列表中所记载的摆放子块图像的顺序,对摆放在地图集中的子块图像内部的像素逐一地进行渲染,从而合成得到一张观众所在视点处的图像。
运动图像专家组沉浸式视频中的对视觉信息进行重表达、重排列的方案,仅按照子块图像面积由大至小排序的策略进行顺序摆放。在摆放时,未考虑各子块间的纹理相似度以及空间位置相似度,这会导致重组织后的地图集图像在送进传统视频编解码器时,不能充分发挥现有编解码方法的效率。
相关技术三、相似度度量的特征数据时空域拼接方法,如图4所示,并基于图像识别任务下的视觉几何组(Visual Geometry Group,VGG)和残差ResNet网络中间层输出的多通道的特征数据建立实验,通过复用现有的视频编码标准H.265/HEVC对特征数据进行压缩编码,编码效率可以比仅空域排列方法平均提升2.27%。
在编码端,目前是将特定层级输出的特征数据按通道顺序拼接为两帧,以均方误差(Mean Square Error,MSE)度量两帧之间的相似度,通过迭代交换两帧的特征数据通道并计算两帧之间的相似度,最终得到两帧相似度最大的一种排列方式,将目标通道序和新的通道排列顺序对应的列表传输至解码端。
在解码端重建得到相应的特征数据之后,利用目标通道和新的通道排列顺序对应的列表恢复得到目标的特征数据排列,并输入到后续任务网络中继续进行推理分析。
在将特征数据按通道顺序分成两帧的前提下,通过交换两帧之间特征数据通道最大化相似度,此时未考虑同一帧中特征数据通道之间的相关性,同时也未能考虑多帧时的排列方法,使得特征数据在编码时并未充分利用不同通道之间的相关性以达到最佳的编码效率。
为了解决相关技术中存在的问题,并充分挖掘和利用特征数据各个通道之间的相似性,本公开提供了一种时空域排序、拼接、编码及解码的技术。该技术的基本思想是:预处理阶段,将神经网络中间层输出的多通道的特征数据进行排序,并按照排序后顺序将各个通道的特征数据按照特定的方式在时空域拼接成多帧的特征帧序列。编码阶段,将特征帧序列以优化后的帧间参考结构进行编码,并将预处理的关键信息进行编码,一并得到最终的码流。解码阶段,从接收到的码流中,解析得到重建的特征帧序列以及重建的预处理关键信息。后处理阶段,根据重建的预处理关键信息,对重建的特征帧序列进行后处理得到重建特征数据,重建特征数据用于后续网络以进一步进行任务推理分析。
本公开的实施例提供一种特征数据的编码方法,应用于编码器;参照图5所示,该方法包括以下步骤:
步骤501、获取待处理图像对应的多个通道的特征数据。
本公开实施例中,步骤501获取待处理图像对应的多个通道的特征数据可以通过如下步骤实现:获取待处理图像;通过神经网络模型对待处理图像进行特征提取,得到多个通道的特征数据。
一些实施例中,编码器获取到待处理图像后,将待处理图像输入神经网络模型中,进而获取到神经网络模型的中间层输出的各通道的特征数据。这里,图像的各通道即图像的各特征图,一个通道是对某个特征的检测,通道中某一处数值的强弱就是对当前特征强弱的反应。
步骤502、确定多个通道的特征数据中的参考通道的特征数据。
本公开实施例中,参考通道的特征数据可以是多个通道的特征数据中任一通道的特征 数据。
确定参考通道的特征数据是为了确定后续对多个通道的特征数据进行排序时的排序起始对象。
步骤503、以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据。
本公开实施例中,在确定参考通道的特征数据的情况下,以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,即按照与参考通道的特征数据相比,相似度由大到小的顺序,对所有通道的特征数据进行排序,得到排序后的多个通道的特征数据。需要说明的是,排序后时空域上相邻通道之间特征数据的相关性较大。
步骤504、将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列。
本公开实施例中,基于神经网络中间层输出的多个通道的特征数据之间的信息冗余,对多个通道的特征数据按照相似度进行排序,之后按照排序顺序在时域和空域上,或者在空域上排列成目标特征帧序列,使得后续编码时可以参考相邻区域相似度较高的特征数据,提高特征数据的编码效率。
步骤505、对目标特征帧序列进行编码,生成码流。
本公开实施例中,将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列的过程中,若先在时域再在空域进行拼接,则可以更好地利用帧间编码技术对特征数据进行编码,而若先在空域再在时域进行拼接,则可以更好地利用帧内编码技术对特征数据进行编码,从而使得可以复用现有的视频编码标准中的技术对特征数据进行高效编码。
本公开实施例所提供的特征数据的编码方法,通过获取待处理图像对应的多个通道的特征数据;确定多个通道的特征数据中的参考通道的特征数据;以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;对目标特征帧序列进行编码,生成码流;也就是说,本公开在获得多个通道的特征数据的情况下,以一个通道的特征数据作为基准,即确定参考通道的特征数据;按照与参考通道的特征数据相比,相似度由大到小的顺序,对所有通道的特征数据进行排序;如此,在排序后时空域上相邻通道的特征数据之间的相关性较大,使得后续编码时可以参考相邻区域相似度较高的特征数据通道,从而提高了特征数据的编码效率。
本公开的实施例提供一种特征数据的编码方法,应用于编码器;参照图6所示,该方法包括以下步骤:
步骤601、获取待处理图像对应的多个通道的特征数据。
步骤602、当多个通道的特征数据中特征数据值的累加和满足目标阈值时,确定累加和对应的通道的特征数据为参考通道的特征数据。
其中,特征数据值的累加和满足目标阈值包括:特征数据值的累加和最大,或者,特征数据值的累加和最小。这里,选择累加和最大或最小的特征数据值对应的通道的特征数据作为参考通道的特征数据时,可以提升编码效率。
步骤603、以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据。
本公开实施例中,确定了排序起始对象后,可以基于迭代算法确定后续剩余各个通道的特征数据与当前通道的特征数据的相似度,这里,相似度度量可借助于绝对误差(Sum of Absolute Difference,SAD)和/或均方误差(Mean Squared Error,MSE);从而依次选择相似度最大的一个通道的特征数据作为排序后下一个通道的特征数据。
示例性的,SAD的计算公式如下:
Figure PCTCN2021078550-appb-000001
示例性的,MSE的计算公式如下:
Figure PCTCN2021078550-appb-000002
步骤604、获得多个通道的特征数据在待处理图像中的原始通道顺序,与在排序后的多个通道的特征数据中的编码通道顺序之间的通道顺序对应关系。
这里,编码通道顺序指的是排序后的每一通道的特征数据所具有的通道顺序。在后续编码的过程中,是参考上述排序后的通道顺序执行编码的,因此,排序后的通道顺序称为编码通道顺序。
本公开实施例中,待所有通道的特征数据按相似度排序完成后,存储排序前后的通道顺序对应关系。在一种可实现的场景中,可以将排序前后的通道顺序对应关系以排序列表channel_idx的方式存储。
排序列表可以存在多种形式,包含但不限于:一维列表、二维列表以及三维列表。
本公开一些实施例中,当时域拼接帧数为一帧时,原始通道顺序为第X个通道,对应的编码通道顺序为第I个通道。
此时,排序列表为一维列表channel_idx[I]=X。其中,对于一维列表channel_idx[I]=X,X可以为排序后排在第I个通道的特征数据排序前对应的通道顺序。
本公开另一些实施例中,原始通道顺序与编码通道顺序之间的对应关系,包括:当时域拼接帧数为至少两帧时,原始通道顺序为第X个通道,对应的编码通道顺序为第N第I个通道。
此时,排序列表为二维列表channel_idx[n][I]=X。其中,对于二维列表channel_idx[N][I]=X,X可以为排序后排在第N帧第I个通道的特征数据排序前对应的通道顺序。
本公开另一些实施例中,原始通道顺序与编码通道顺序之间的对应关系,包括:当时域拼接帧数为至少两帧时,原始通道顺序为第X个通道,对应的编码通道顺序为第N帧第M区域第I个通道。
此时,排序列表为三维列表channel_idx[N][M][I]=X。其中,对于三维列表channel_idx[N][M][I]=X,X可以为排序后排在第N帧第M区域第I个通道的特征数据排序前对应的通道顺序。
步骤605、将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列。
本公开实施例中,将排序后的特征数据在时空域上按照特定的拼接方式进行拼接,在时域上拼接为时域拼接帧数为frame_count的目标特征帧序列。时域拼接帧数为编码端设置的在时间域上对排序后的多个通道的特征数据进行拼接,拼接后得到的帧数。
本公开一些实施例中,若时域拼接帧数frame_count为1时,则在排序后,特征数据只在空域上进行拼接。编码端可以根据实际需求,灵活设置时域拼接帧数。
本公开一些实施例中假设拼接后特征数据为row行col列个通道的特征数据,特征数据的通道数为C,若:
C<row*col*frame_count
则此时可以在最后一帧中对空余的特征数据通道进行填充使其充满一帧进行编码。
步骤606、对目标特征帧序列进行编码,生成码流,并将通道顺序对应关系写入码流中。
需要说明的是,本公开实施例中与其它实施例中相同步骤和相同内容的说明,可以参照其它实施例中的描述,此处不再赘述。
本公开的实施例提供一种特征数据的编码方法,应用于编码器;参照图7所示,该方法包括以下步骤:
步骤701、获取待处理图像对应的多个通道的特征数据。
步骤702、确定多个通道的特征数据中的参考通道的特征数据。
步骤703、以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据。
步骤704、确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列。
本公开实施例中,参见图8所示,步骤704确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列,可以通过如下步骤实现:
步骤801、确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到拼接后的特征数据。
步骤802、确定拼接后的特征数据的行数、拼接后的特征数据的列数以及时域拼接帧数的乘积。
步骤803、确定多个通道的特征数据的通道数小于乘积,对拼接后的帧中缺少特征数据通道的区域进行填充,得到目标特征帧序列。
这里,对拼接后的帧中缺少特征数据通道的区域进行填充,即对拼接后的特征帧序列中缺少特征数据通道的区域进行填充,以提高编码效率。其中,缺少特征数据通道的区域可以是拼接后的特征帧序列中最后一帧中的区域。缺少特征数据通道的区域也可以是拼接后的特征帧序列中与最后一帧不同的至少一帧中的区域。
本公开实施例中,参见图9所示,步骤704确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列,可以通过如下步骤实现:
步骤901、确定时域拼接帧数大于一帧,在时空域上按照先时域后空域的拼接策略,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
步骤902、在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行拼接。
这里,先在时域再在空域进行拼接,则可以更好地利用帧间编码技术对特征数据进行编码,从而使得可以复用现有的视频编码标准中的技术对特征数据进行高效编码。
本公开实施例中,参见图10所示,步骤704确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列,可以通过如下步骤实现:
步骤1001、确定时域拼接帧数大于一帧,在时空域上按照先空域后时域的拼接策略,在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行拼接。
步骤1002、在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
这里,先在空域再在时域进行拼接,则可以更好地利用帧内编码技术对特征数据进行编码,从而使得可以复用现有的视频编码标准中的技术对特征数据进行高效编码。
步骤705、确定时域拼接帧数为一帧,在空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列。
步骤706、对目标特征帧序列进行编码,生成码流。
步骤707、将时域拼接帧数、多个通道的特征数据对应的通道数量、单个通道的特征数据的高度和单个通道的特征数据的宽度写入码流中。
在一个可实现的场景中,对光栅扫描拼接进行进一步的说明,以拼接为总帧数为4帧 的视频序列为例,光栅扫描拼接示意图如图11所示,排序后特征数据的排序顺序包含但不限于:
首先在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接,其次在空域上按照光栅扫描顺序在相邻位置进行拼接;
首先在空域上按照光栅扫描顺序在相邻位置进行拼接,其次在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
在一个可实现的场景中,对Z字形扫描拼接进行进一步的说明,以拼接为总帧数为4帧的视频序列为例,Z字形拼接示意图如图12所示,排序后特征数据的排序顺序包含但不限于:
首先在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接,其次在空域上按照Z字形扫描顺序在相邻位置进行拼接;
首先空域上按照Z字形扫描顺序在相邻位置进行拼接,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
本公开实施例中除了需要传统视频编码后生成的码流信息之外,还传输以下额外信息,额外信息又称为特征数据时空排列信息:特征数据的通道数C、单个通道的特征数据的高h、单个通道的特征数据的宽w、排序列表channel_idx、时域拼接帧数frame_count。
需要说明的是,本公开实施例中与其它实施例中相同步骤和相同内容的说明,可以参照其它实施例中的描述,此处不再赘述。
本公开的实施例提供一种特征数据的编码方法,应用于编码器;参照图13所示,该方法包括以下步骤:
步骤1101、获取待处理图像对应的多个通道的特征数据。
步骤1102、确定多个通道的特征数据中的参考通道的特征数据。
步骤1103、以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据。
步骤1104、在空域上按照先填充后拼接的策略,将排序后的特征数据进行拼接。
本公开实施例中,步骤1104在空域上按照先填充后拼接的策略,将排序后的特征数据进行拼接,可以通过如下步骤实现:在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据进行拼接;其中,填充后的相邻通道的特征数据之间具有缝隙。
参见图14所示,在空域上对每一排序后的特征数据进行填充,包括:在相邻通道的特征数据之间进行填充,确保填充后的相邻通道的特征数据之间具有缝隙。进一步地,相邻通道的特征数据之间的缝隙大小可以相同。例如,每一个小框与每一个虚线框之间上下左右距离相同。本公开实施例中,对相邻通道的特征数据之间进行填充,减少了不同通道之间值的互相影响,提高了通道边界的信号保真度。
步骤1105、对目标特征帧序列进行编码,生成码流。
步骤1106、将时域拼接帧数、填充后的特征数据的高度和填充后的特征数据的宽度写入码流,并将多个通道的特征数据对应的通道数量、单个通道的特征数据的高度和单个通道的特征数据的宽度写入码流中。
这里,特征数据时空排列信息还包括:填充后的特征数据的高度和填充后的特征数据的宽度。
本公开其他实施例中,步骤1104中在空域上按照先填充后拼接的策略,将排序后的特征数据进行拼接的方案同样适用于步骤901、步骤1001和步骤705;例如,在执行步骤901的过程中,确定时域拼接帧数大于一帧,在时空域上按照先时域后空域的拼接策略,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接;基于在空域上先填充后拼接的策略,在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据,在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行 拼接。
例如,在执行步骤1001的过程中,确定时域拼接帧数大于一帧,基于在空域上先填充后拼接的策略,在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据,在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行拼接,之后,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
例如,在执行步骤705的过程中,确定时域拼接帧数为一帧,基于在空域上先填充后拼接的策略,在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据,在空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列。
在一种实现方式中,特征数据时空排列信息可以记录在补充增强信息中(例如现有视频编码标准H.265/HEVC、H.266/VVC的Supplemental Enhancement Information(SEI)或AVS标准的扩展数据(Extension Data))。例如,在现有视频编码标准AVC/HEVC/VVC/EVC的sei_rbsp()中sei_message()的sei_paylod(),在其中增加一种新的SEI类别,即Feature data quantization SEI message,payloadType可以定义为任意其他SEI没有使用过的编号,例如183,此时,语法结构如表1所示。
表1 sei_payload()语法结构
Figure PCTCN2021078550-appb-000003
若排序列表为一维排序列表,其语法结构为:
Figure PCTCN2021078550-appb-000004
语法元素可以用不同的高效熵编码方式进行编码,其中语法元素为:
feature_channel_count:用于描述特征数据的通道数为feature_channel_count;
feature_frame_count:用于描述特征数据拼接后的帧数为feature_frame_count;
feature_single_channel_height:用于描述单个通道的特征数据的高为feature_single_channel_height;
feature_single_channel_width:用于描述单个通道的特征数据的宽为feature_single_channel_width;
channel_idx[I]:用于描述排序后排在第I个通道的特征数据排序前对应的通道顺序channel_idx[I]。
需要说明的是,本公开实施例中与其它实施例中相同步骤和相同内容的说明,可以参照其它实施例中的描述,此处不再赘述。
本公开的实施例提供一种特征数据的解码方法,应用于解码器;参照图15所示,该方法包括以下步骤:
步骤1201、解析码流,获得重建的特征帧序列。
步骤1202、对重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
本公开实施例所提供的解码方法,通过解析码流,获得重建的特征帧序列;对重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据,从而能够准确恢复出时空域排序前的多个通道的特征数据,用于后续网络以进一步进行任务推理分析。
本公开的实施例提供一种特征数据的解码方法,应用于解码器;参照图16所示,该方法包括以下步骤:
步骤1301、解析码流,获得重建的特征帧序列、通道顺序对应关系、通道数量、时域拼接帧数、单个通道的特征数据的高度和单个通道的特征数据的宽度。
步骤1302、基于通道数量、时域拼接帧数、单个通道的特征数据的高度以及单个通道的特征数据的宽度,确定重建的特征帧序列中每一通道的特征数据在的位置。
步骤1303、基于通道顺序对应关系,确定重建的特征帧序列中不同位置的特征数据的原始通道顺序。
步骤1304、基于原始通道顺序,对重建的特征帧序列中不同位置的特征数据进行逆排序,得到重建的多个通道的特征数据。
示例性的,解码端解码得到重建的特征帧序列和重建的特征数据时空排列信息后,对重建的特征帧序列进行时空逆排列操作,得到重建特征数据,步骤如下:
基于重建的特征数据时空排列信息中的特征数据的通道数C,时域拼接帧数frame_count,以及单个通道的特征数据的高h和单个通道的特征数据的宽w,确定特征帧序列中每一通道的特征数据在的位置;
基于重建的特征数据时空排列信息中的排序列表channel_idx,以一维排序列表channel_idx[I]=X为例,确定每一通道的特征数据在排序前的原始通道顺序,待确定出所有通道的特征数据的原始通道顺序,基于原始通道顺序,对重建的特征帧序列中不同位置的特征数据进行逆排序,得到重建的多个通道的特征数据。
需要说明的是,本公开实施例中与其它实施例中相同步骤和相同内容的说明,可以参照其它实施例中的描述,此处不再赘述。
本公开的实施例提供一种特征数据的解码方法,应用于解码器;参照图17所示,该方法包括以下步骤:
步骤1401、解析码流,获得重建的特征帧序列、通道顺序对应关系、通道数量、时域拼接帧数、填充后的特征数据的高度、填充后的特征数据的宽度、单个通道的特征数据的高度和单个通道的特征数据的宽度。
步骤1402、基于通道数量、时域拼接帧数、填充后的特征数据的高度、填充后的特征数据的宽度、单个通道的特征数据的高度以及单个通道的特征数据的宽度,确定重建的特征帧序列中每一通道的特征数据所在的位置。
步骤1403、基于通道顺序对应关系,确定重建的特征帧序列中不同位置的特征数据的原始通道顺序。
步骤1404、基于原始通道顺序,对重建的特征帧序列中不同位置的特征数据进行逆排序,得到重建的多个通道的特征数据。
需要说明的是,本公开实施例中与其它实施例中相同步骤和相同内容的说明,可以参照其它实施例中的描述,此处不再赘述。
本公开至少具有如下有益效果:基于神经网络中间层输出的多通道特征数据不同通道之间的信息冗余,对多通道特征数据的所有通道按照相似度进行排序,之后按照排序顺序在时域和空域上排列成特征帧序列,使得编码时可以参考相邻区域相似度较高的特征数据通道,提高特征数据的编码效率。若先在时域再在空域进行拼接,则可以更好地利用帧间编码技术对特征数据进行编码,而若先在空域再在时域进行拼接,则可以更好地利用帧内 编码技术对特征数据进行编码,从而使得可以复用现有的视频编码标准中的技术对特征数据进行高效编码。
也就是说,本公开为了高效复用现有的视频编码标准中的技术对神经网络中间层输出的多通道特征数据进行编码,将特征数据的所有通道按照相似度进行排序并在时域和空域上进行排列成特征帧序列。由于排列后时域和空域上相邻通道之间的相关性较大,因此本公开能够更好的利用现有的帧内预测及帧间预测,进一步提高了特征数据的编码效率。为了在解码之后能够恢复时空排列前的多通道特征数据,需要在码流中记录特征数据的时空排列信息。
图18为本公开实施例提供的编码设备的组成结构示意图,如图18所示,编码设备150包括第一获得单元1501、第一处理单元1502和编码单元1503,其中:
第一获得单元1501,配置为获取待处理图像对应的多个通道的特征数据;
第一处理单元1502,配置为确定多个通道的特征数据中的参考通道的特征数据;
第一处理单元1502,配置为以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;
第一处理单元1502,配置为将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;
编码单元1503,配置为对目标特征帧序列进行编码,生成码流。
在本公开其他实施例中,第一处理单元1502,配置为当多个通道的特征数据中特征数据值的累加和满足目标阈值时,确定累加和对应的通道的特征数据为参考通道的特征数据。
在本公开其他实施例中,特征数据值的累加和满足目标阈值包括:特征数据值的累加和最大,或者,特征数据值的累加和最小。
在本公开其他实施例中,第一获得单元1501,配置为获得多个通道的特征数据在待处理图像中的原始通道顺序,与在排序后的多个通道的特征数据中的编码通道顺序之间的通道顺序对应关系;
编码单元1503,配置为将通道顺序对应关系写入码流中。
在本公开其他实施例中,通道顺序对应关系,包括:
当时域拼接帧数为一帧时,原始通道顺序为第X个通道,对应的编码通道顺序为第I个通道;
当时域拼接帧数为至少两帧时,原始通道顺序为第X个通道,对应的编码通道顺序为第N第I个通道。
在本公开其他实施例中,第一处理单元1502,配置为确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到目标特征帧序列。
在本公开其他实施例中,第一处理单元1502,配置为确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将排序后的特征数据进行拼接,得到拼接后的特征数据;
确定拼接后的特征数据的行数、拼接后的特征数据的列数以及时域拼接帧数的乘积;
确定多个通道的特征数据的通道数小于乘积,对拼接后的帧中缺少特征数据通道的区域进行填充,得到目标特征帧序列。
在本公开其他实施例中,第一处理单元1502,配置为确定时域拼接帧数大于一帧,在时空域上按照先时域后空域的拼接策略,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接;
在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行拼接。
在本公开其他实施例中,第一处理单元1502,配置为确定时域拼接帧数大于一帧,在时空域上按照先空域后时域的拼接策略,在空域上按照光栅扫描顺序在相邻位置进行拼接, 或者在空域上按照Z字形扫描顺序在相邻位置进行拼接;
在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
在本公开其他实施例中,第一处理单元1502,配置为确定时域拼接帧数为一帧,在空域上按照拼接策略,将排序后的通道特征数据进行拼接,得到目标特征帧序列。
在本公开其他实施例中,第一处理单元1502,配置为在空域上按照先填充后拼接的策略,将排序后的通道特征数据进行拼接。
在本公开其他实施例中,第一处理单元1502,配置为在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据进行拼接;其中,填充后的相邻通道的特征数据之间具有缝隙。
在本公开其他实施例中,编码单元1503,配置为将填充后的特征数据的高度和填充后的特征数据的宽度写入码流。将多个通道的特征数据对应的通道数量、单个通道的特征数据的高度和单个通道的特征数据的宽度写入码流中。将时域拼接帧数写入码流中。
在本公开其他实施例中,第一获得单元1501,配置为获取待处理图像;
第一处理单元1502,配置为通过神经网络模型对待处理图像进行特征提取,得到多个通道的特征数据。
图19为本公开实施例提供的解码设备的组成结构示意图,如图19所示,解码设备160包括解码单元1601和第二处理单元1602,其中:
解码单元1601,配置为解析码流,获得重建的特征帧序列;
第二处理单元1602,配置为对重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
在本公开其他实施例中,解码单元1601,配置为解析码流,获得通道顺序对应关系、通道数量、时域拼接帧数、单个通道的特征数据的高度和单个通道的特征数据的宽度。
第二处理单元1602,配置为基于通道数量、时域拼接帧数、单个通道的特征数据的高度以及单个通道的特征数据的宽度,确定重建的特征帧序列中每一通道的特征数据在的位置;基于通道顺序对应关系,确定重建的特征帧序列中不同位置的特征数据的原始通道顺序;基于原始通道顺序,对重建的特征帧序列中不同位置的特征数据进行逆排序,得到重建的多个通道的特征数据。
在本公开其他实施例中,解码单元1601,配置为解析码流,获得填充后的特征数据的高度和填充后的特征数据的宽度;
第二处理单元1602,配置为基于通道数量、时域拼接帧数、填充后的特征数据的高度、填充后的特征数据的宽度、单个通道的特征数据的高度以及单个通道的特征数据的宽度,确定重建的特征帧序列中每一通道的特征数据在的位置。
图20为本公开实施例提供的编码设备的组成结构示意图,如图20所示,编码设备170(图20中的编码设备170与图18中的编码设备150相对应)包括第一存储器1701和第一处理器1702,其中:
第一处理器1702,用于执行第一存储器1701中存储的编码指令时,实现本公开实施例提供的编码方法。
其中,第一处理器1702可以通过软件、硬件、固件或者其组合实现,可以使用电路、单个或多个专用集成电路(application specific integrated circuits,ASIC)、单个或多个通用集成电路、单个或多个微处理器、单个或多个可编程逻辑器件、或者前述电路或器件的组合、或者其他适合的电路或器件,从而使得该处理器可以执行前述编码方法的相应步骤。
图21为本公开实施例提供的解码设备的组成结构示意图,如图21所示,解码设备180(图21中的解码设备180与图19中的解码设备160相对应)包括第二存储器1801和第二处理器1802,其中:
第二处理器1802,用于执行第二存储器1801中存储的解码指令时,实现本公开实施 例提供的解码方法。
其中,第二处理器1802可以通过软件、硬件、固件或者其组合实现,可以使用电路、单个或多个专用集成电路(application specific integrated circuits,ASIC)、单个或多个通用集成电路、单个或多个微处理器、单个或多个可编程逻辑器件、或者前述电路或器件的组合、或者其他适合的电路或器件,从而使得该处理器可以执行前述解码方法的相应步骤。
在本公开实施例中的各组成部分可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
集成的单元如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时,可以存储在一个计算机可读取存储介质中,基于这样的理解,本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,云服务器,或者网络设备等)或processor(处理器)执行本实施例方法的全部或部分步骤。而前述的存储介质包括:磁性随机存取存储器(FRAM,ferromagnetic random access memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory)等各种可以存储程序代码的介质,本公开实施例不作限制。
本公开实施例还提供了一种计算机可读存储介质,存储有可执行编码指令,用于引起第一处理器执行时,实现本公开实施例提供的编码方法。
本公开实施例还提供了一种计算机可读存储介质,存储有可执行解码指令,用于引起第二处理器执行时,实现本公开实施例提供的解码方法。
工业实用性
本公开实施例提供了一种特征数据的编码方法、解码方法、编码器、解码器及存储介质,通过获取待处理图像对应的多个通道的特征数据;确定多个通道的特征数据中的参考通道的特征数据;以参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;将排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;对目标特征帧序列进行编码,生成码流;也就是说,本公开在获得多个通道的特征数据的情况下,以一个通道的特征数据作为基准,即确定参考通道的特征数据;按照与参考通道的特征数据相比,相似度由大到小的顺序,对所有通道的特征数据进行排序;如此,在排序后时空域上相邻通道的特征数据之间的相关性较大,使得后续编码时可以参考相邻区域相似度较高的特征数据通道,从而提高了特征数据的编码效率。

Claims (25)

  1. 一种特征数据的编码方法,包括:
    获取待处理图像对应的多个通道的特征数据;
    确定所述多个通道的特征数据中的参考通道的特征数据;
    以所述参考通道的特征数据为排序起始对象,按照所述多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;
    将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;
    对所述目标特征帧序列进行编码,生成码流。
  2. 根据权利要求1所述方法,所述确定所述多个通道的特征数据中的参考通道的特征数据,包括:
    当所述多个通道的特征数据中特征数据值的累加和满足目标阈值时,确定所述累加和对应的通道的特征数据为所述参考通道的特征数据。
  3. 根据权利要求2所述方法,所述特征数据值的累加和满足所述目标阈值包括:所述特征数据值的累加和最大,或者,所述特征数据值的累加和最小。
  4. 根据权利要求1所述方法,所述以所述参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据之后,所述方法包括:
    获得所述多个通道的特征数据在所述待处理图像中的原始通道顺序,与在所述排序后的多个通道的特征数据中的编码通道顺序之间的通道顺序对应关系;
    将所述通道顺序对应关系写入所述码流中。
  5. 根据权利要求4所述方法,所述通道顺序对应关系,包括:
    当时域拼接帧数为一帧时,所述原始通道顺序为第X个通道,对应的所述编码通道顺序为第I个通道;
    当所述时域拼接帧数为至少两帧时,所述原始通道顺序为第X个通道,对应的所述编码通道顺序为第N第I个通道。
  6. 根据权利要求1所述方法,所述将所述排序后的特征数据进行拼接,得到目标特征帧序列,包括:
    确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,得到目标特征帧序列。
  7. 根据权利要求6所述方法,所述确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,得到目标特征帧序列,包括:
    确定所述时域拼接帧数大于一帧,在时空域上按照所述拼接策略,将所述排序后的特征数据进行拼接,得到拼接后的特征数据;
    确定拼接后的特征数据的行数、拼接后的特征数据的列数以及所述时域拼接帧数的乘积;
    确定所述多个通道的特征数据的通道数小于所述乘积,对所述拼接后的帧中缺少特征数据通道的区域进行填充,得到所述目标特征帧序列。
  8. 根据权利要求6所述方法,所述确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,包括:
    确定时域拼接帧数大于一帧,在时空域上按照先时域后空域的拼接策略,在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接;
    在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺 序在相邻位置进行拼接。
  9. 根据权利要求6所述方法,所述确定时域拼接帧数大于一帧,在时空域上按照拼接策略,将所述排序后的特征数据进行拼接,包括:
    确定时域拼接帧数大于一帧,在时空域上按照先空域后时域的拼接策略,在空域上按照光栅扫描顺序在相邻位置进行拼接,或者在空域上按照Z字形扫描顺序在相邻位置进行拼接;
    在时域上按照光栅扫描顺序在不同帧的相同位置进行拼接。
  10. 根据权利要求1所述方法,所述将所述排序后的特征数据进行拼接,得到目标特征帧序列,包括:
    确定时域拼接帧数为一帧,在空域上按照拼接策略,将所述排序后的特征数据进行拼接,得到所述目标特征帧序列。
  11. 根据权利要求1所述方法,所述将所述排序后的特征数据进行拼接,包括:
    在空域上按照先填充后拼接的策略,将所述排序后的特征数据进行拼接。
  12. 根据权利要求11所述方法,所述在空域上按照先填充后拼接的策略,将所述排序后的特征数据进行拼接,包括:
    在空域上对每一排序后的特征数据进行填充,在空域上将填充后的特征数据进行拼接;其中,填充后的相邻通道的特征数据之间具有缝隙。
  13. 根据权利要求12所述方法,所述在空域上将填充后的特征数据进行拼接之后,所述方法包括:
    将填充后的特征数据的高度和填充后的特征数据的宽度写入所述码流。
  14. 根据权利要求1所述方法,所述方法还包括:
    将所述多个通道的特征数据对应的通道数量、单个通道的特征数据的高度和单个通道的特征数据的宽度写入所述码流中。
  15. 根据权利要求5所述方法,所述方法还包括:
    将所述时域拼接帧数写入所述码流中。
  16. 根据权利要求1所述方法,所述方法还包括:
    获取待处理图像;
    通过神经网络模型对所述待处理图像进行特征提取,得到所述多个通道的特征数据。
  17. 一种特征数据的解码方法,包括:
    解析码流,获得重建的特征帧序列;
    对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
  18. 根据权利要求17所述的方法,所述方法还包括:
    解析所述码流,获得通道顺序对应关系、通道数量、时域拼接帧数、单个通道的特征数据的高度和单个通道的特征数据的宽度;
    基于所述通道数量、所述时域拼接帧数、所述单个通道的特征数据的高度以及所述单个通道的特征数据的宽度,确定所述重建的特征帧序列中每一通道的特征数据在的位置;
    相应的,所述对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据,包括:
    基于所述通道顺序对应关系,确定所述重建的特征帧序列中不同位置的特征数据的原始通道顺序;
    基于所述原始通道顺序,对所述重建的特征帧序列中不同位置的特征数据进行逆排序,得到所述重建的多个通道的特征数据。
  19. 根据权利要求18所述的方法,所述方法还包括:
    解析所述码流,获得填充后的特征数据的高度和填充后的特征数据的宽度;
    相应的,所述基于所述通道数量、所述时域拼接帧数、所述单个通道的特征数据的高度以及所述单个通道的特征数据的宽度,确定所述重建的特征帧序列中每一通道的特征数据在的位置,包括:
    基于所述通道数量、所述时域拼接帧数、所述填充后的特征数据的高度、所述填充后的特征数据的宽度、所述单个通道的特征数据的高度以及所述单个通道的特征数据的宽度,确定所述重建的特征帧序列中每一通道的特征数据在的位置。
  20. 一种编码器,所述编码器包括第一获得单元、第一处理单元和编码单元;其中,
    所述第一获得单元,配置为获取待处理图像对应的多个通道的特征数据;
    所述第一处理单元,配置为确定所述多个通道的特征数据中的参考通道的特征数据;
    所述第一处理单元,配置为以所述参考通道的特征数据为排序起始对象,按照多个通道的特征数据之间相似度递减的顺序,对所述多个通道的特征数据进行排序,得到排序后的多个通道的特征数据;
    所述第一处理单元,配置为将所述排序后的多个通道的特征数据进行拼接,得到目标特征帧序列;
    所述编码单元,配置为对所述目标特征帧序列进行编码,生成码流。
  21. 一种编码器,所述编码器包括第一存储器和第一处理器;其中,
    所述第一存储器,用于存储能够在所述第一处理器上运行的计算机程序;
    所述第一处理器,用于在运行所述计算机程序时,执行如权利要求1至16任一项所述的方法。
  22. 一种解码器,所述解码器包括解码单元和第二处理单元;其中,
    所述解码单元,配置为解析码流,获得重建的特征帧序列;
    所述第二处理单元,配置为对所述重建的特征帧序列进行逆排序,得到重建的多个通道的特征数据。
  23. 一种解码器,所述解码器包括第二存储器和第二处理器;其中,
    所述第二存储器,用于存储能够在所述第二处理器上运行的计算机程序;
    所述第二处理器,用于在运行所述计算机程序时,执行如权利要求17至19任一项所述的方法。
  24. 一种计算机可读存储介质,存储有可执行编码指令,用于引起第一处理器执行时,实现权利要求1至16任一项所述的方法。
  25. 一种计算机可读存储介质,存储有可执行解码指令,用于引起第二处理器执行时,实现权利要求17至19中任一项所述的方法。
PCT/CN2021/078550 2021-03-01 2021-03-01 特征数据的编码方法、解码方法、设备及存储介质 WO2022183346A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2021/078550 WO2022183346A1 (zh) 2021-03-01 2021-03-01 特征数据的编码方法、解码方法、设备及存储介质
CN202180094401.6A CN116868570A (zh) 2021-03-01 2021-03-01 特征数据的编码方法、解码方法、设备及存储介质
EP21928437.9A EP4304176A1 (en) 2021-03-01 2021-03-01 Feature data encoding method, feature data decoding method, devices, and storage medium
US18/458,937 US20230412820A1 (en) 2021-03-01 2023-08-30 Methods for encoding and decoding feature data, and decoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/078550 WO2022183346A1 (zh) 2021-03-01 2021-03-01 特征数据的编码方法、解码方法、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/458,937 Continuation US20230412820A1 (en) 2021-03-01 2023-08-30 Methods for encoding and decoding feature data, and decoder

Publications (1)

Publication Number Publication Date
WO2022183346A1 true WO2022183346A1 (zh) 2022-09-09

Family

ID=83153804

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078550 WO2022183346A1 (zh) 2021-03-01 2021-03-01 特征数据的编码方法、解码方法、设备及存储介质

Country Status (4)

Country Link
US (1) US20230412820A1 (zh)
EP (1) EP4304176A1 (zh)
CN (1) CN116868570A (zh)
WO (1) WO2022183346A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109254946A (zh) * 2018-08-31 2019-01-22 郑州云海信息技术有限公司 图像特征提取方法、装置、设备及可读存储介质
CN110494892A (zh) * 2017-05-31 2019-11-22 三星电子株式会社 用于处理多通道特征图图像的方法和装置
WO2021011315A1 (en) * 2019-07-15 2021-01-21 Facebook Technologies, Llc System and method for shift-based information mixing across channels for shufflenet-like neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110494892A (zh) * 2017-05-31 2019-11-22 三星电子株式会社 用于处理多通道特征图图像的方法和装置
CN109254946A (zh) * 2018-08-31 2019-01-22 郑州云海信息技术有限公司 图像特征提取方法、装置、设备及可读存储介质
WO2021011315A1 (en) * 2019-07-15 2021-01-21 Facebook Technologies, Llc System and method for shift-based information mixing across channels for shufflenet-like neural networks

Also Published As

Publication number Publication date
US20230412820A1 (en) 2023-12-21
CN116868570A (zh) 2023-10-10
EP4304176A1 (en) 2024-01-10

Similar Documents

Publication Publication Date Title
US20210203997A1 (en) Hybrid video and feature coding and decoding
CN103220528B (zh) 通过使用大型变换单元编码和解码图像的方法和设备
US9083947B2 (en) Video encoder, video decoder, method for video encoding and method for video decoding, separately for each colour plane
CN110691250B (zh) 结合块匹配和串匹配的图像压缩装置
CN103621096A (zh) 用于使用自适应滤波对图像进行编码和解码的方法和设备
US20160050440A1 (en) Low-complexity depth map encoder with quad-tree partitioned compressed sensing
US20130163676A1 (en) Methods and apparatus for decoding video signals using motion compensated example-based super-resolution for video compression
WO2020001325A1 (zh) 一种图像编码方法、解码方法、编码器、解码器及存储介质
US11838519B2 (en) Image encoding/decoding method and apparatus for signaling image feature information, and method for transmitting bitstream
CN104704826A (zh) 两步量化和编码方法和装置
JPWO2006035883A1 (ja) 画像処理装置、画像処理方法、および画像処理プログラム
US20230396787A1 (en) Video compression method and apparatus, computer device, and storage medium
Zhu et al. Video coding with spatio-temporal texture synthesis and edge-based inpainting
US20230388490A1 (en) Encoding method, decoding method, and device
WO2022183346A1 (zh) 特征数据的编码方法、解码方法、设备及存储介质
RU2766557C1 (ru) Устройство обработки изображений и способ выполнения эффективного удаления блочности
CN114846789B (zh) 用于指示条带的图像分割信息的解码器及对应方法
Misra et al. Video feature compression for machine tasks
CN1672420A (zh) 压缩包括交替镜头的视频序列的数字数据的方法
US20230412817A1 (en) Encoding method, decoding method, and decoder
WO2022073159A1 (zh) 特征数据的编解码方法、装置、设备及存储介质
RU2787812C2 (ru) Способ и аппаратура предсказания видеоизображений
RU2779474C1 (ru) Устройство обработки изображений и способ выполнения эффективного удаления блочности
CN116708787A (zh) 编解码方法和装置
CN115604486A (zh) 视频图像的编解码方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21928437

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180094401.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2021928437

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021928437

Country of ref document: EP

Effective date: 20231002